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Preface 



This is a book about algorithms for performing arithmetic, and their imple- 
mentation on modern computers. We are concerned with software more than 
hardware — we do not cover computer architecture or the design of computer 
hardware since good books are already available on these topics. Instead we 
focus on algorithms for efficiently performing arithmetic operations such as 
addition, multiplication and division, and their connections to topics such as 
modular arithmetic, greatest common divisors, the Fast Fourier Transform 
(FFT), and the computation of special functions. 

The algorithms that we present are mainly intended for arbitrary-precision 
arithmetic. That is, they are not limited by the computer wordsize of 32 or 
64 bits, only by the memory and time available for the computation. We 
consider both integer and real (floating-point) computations. 

The book is divided into four main chapters, plus an appendix. Chapter Q 
covers integer arithmetic. This has, of course, been considered in many other 
books and papers. However, there has been much recent progress, inspired in 
part by the application to public key cryptography, so most of the published 
books are now partly out of date or incomplete. Our aim has been to present 
the latest developments in a concise manner. 

Chapter El is concerned with the FFT and modular arithmetic, and their 
applications to computer arithmetic. We consider different number represen- 
tations, fast algorithms for multiplication, division and exponentiation, and 
the use of the Chinese Remainder Theorem (CRT). 

Chapter |3] covers floating-point arithmetic. Our concern is with high- 
precision floating-point arithmetic, implemented in software if the precision 
provided by the hardware (typically IEEE standard 64-bit arithmetic) is in- 
adequate. The algorithms described in this chapter focus on correct rounding, 
extending the IEEE standard to arbitrary precision. 

Chapter |U deals with the computation, to arbitrary precision, of functions 
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such as sqrt, exp, In, sin, cos, and more generally functions defined by power 
series or continued fractions. We also consider the computation of certain 
constants, such as tt and (Euler's constant) 7. Of course, the computation of 
special functions is a huge topic so we have had to be selective. In particular, 
we have concentrated on methods that are efficient and suitable for arbitrary- 
precision computations. 

For details that are omitted we give pointers in the Notes and References 
sections of each chapter, and in the bibliography. Finally, the Appendix 
contains pointers to implementations, useful web sites, mailing lists, and so 
on. 

The book is intended for anyone interested in the design and implemen- 
tation of efficient algorithms for computer arithmetic, and more generally 
efficient numerical algorithms. We did our best to present algorithms that 
are ready to implement in your favorite language, while keeping a high-level 
description. 

Although the book is not specifically intended as a textbook, it could be 
used in a graduate course in mathematics or computer science, and for this 
reason, as well as to cover topics that could not be discussed at length in the 
text, we have included exercises at the end of each chapter. For solutions to 
the exercises, please contact the authors. 

We thank the French National Institute for Research in Computer Science 
and Control (INRIA), the Australian National University (ANU), and the 
Australian Research Council (ARC), for their support. The book could not 
have been written without the contributions of many friends and colleagues, 
too numerous to mention here, but acknowledged in the text and in the Notes 
and References sections at the end of each chapter. 

Finally, we acknowledge Erin Brent, who first suggested writing the book; 
and thank our wives, Judy-anne and Marie, for their patience and encour- 
agement. 

This is a preliminary version — there are still a few exercises to be added. 
We welcome comments and corrections. Please send them to either of the 
authors. 

Richard Brent and Paul Zimmermann 

MCAOrpbrent . com 

Paul . Zimmermann@inria.fr 

Canberra and Nancy, June 2009 
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Notation 



C set of complex numbers 

C set of extended complex numbers C U {00} 

N set of natural numbers (nonnegative integers) 

N* set of positive integers N\{0} 

Q set of rational numbers 

R set of real numbers 

Z set of integers 

Z/nZ ring of residues modulo n 

C n set of (real or complex) functions with n continuous derivatives 
in the region of interest 



$l(z) real part of a complex number z 

Q(z) imaginary part of a complex number z 

z conjugate of the complex number z 

\z\ Euclidean norm of the complex number z 



B n Bernoulli numbers, J2 n>0 B n z n /nl = z/(e z — 1) 

C n Scaled Bernoulli numbers, C n = B 2n /{2n)\ , £ C n z 2n = (z/2)/ tanh(z/2) 

T n Tangent numbers, ^T n z 2n_1 /(2n — 1)! = tanz 

H n Harmonic number Y^j=\ l (0 if n < 0) 

(?) Binomial coefficient "n choose fc" = , , , n '_, „ (0 if k < or k > n) 

"word" base (usually 2 32 or 2 64 ) 

n "precision" : number of base f3 digits in an integer or in a floating- 
point significand, or a free variable, depending on the context 

e "machine precision": t^ 1- ™ 

77 smallest positive subnormal number 

o{x) rounding of real number x 

ulp(x) for a floating-point number x, one unit in the last place 

11 
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M(n) time to multiply n-bit integers or polynomials of degree n — 1, 

depending on the context 
M(m,n) time to multiply an m-bit integer by an n-bit integer 

D(n) time to divide a 2n-bit integer by an n-bit integer, 

giving quotient and remainder 
D(m,n) time to divide an m-bit integer by an n-bit integer, 

giving quotient and remainder 
sign(n) +1 if n > 0, — 1 if n < 0, and if n = 

r = a mod b integer remainder (0 < r < b) 

q = a div b integer quotient (0 < a — qb < b) 

(a, b) greatest common divisor of a and b 

i A j bitwise and of integers i and j, 

or logical and of two Boolean expressions 
i® j bitwise exclusive- or of integers i and j 

i <^ k integer i multiplied by 2 k 

i ^> k quotient of division of integer i by 2 

(to, n) greatest common divisor of integers m and n 

v{n) 2-valuation: largest k such that 2 divides n (^(0) = oo) 

a(e) length of the shortest addition chain to compute e 

(p(n) Euler's totient function, #{to : < to < n A (m,n) = 1} 

deg(A) for a polynomial A, the degree of A 

ovd(A) for a power series A = J2j a j z ° > ord(A) = min{j : a,j ^ 0} 

(note the special case ord(0) = +oo) 

exp(x) or e x exponential function 

ln(ir) natural logarithm 

log(x) natural logarithm, or logarithm to any fixed base 

log 2 (2i), lg(ir) base-2 logarithm 

nbits(n) U§( n )J + 1 if n > 0, if n = 

t [a,b] or [a, 6]* column vector I 

r 7 7i or. .fab 

\a,b:c,d\ 2x2 matrix , 

\ c d 

f(n) = 0(g(n)) 3c, uq such that \f(n)\ < cg(n) for all n > no 

f(n) = 0(g(n)) f(n) = 0(g(n)) and g(n) = 0(f(n)) 

f(n)~g(n) /(n)/#)^lasn^oo 

/(n) = o{g(n)) f(n)/g(n) -> as n -> oo 

/(n) <C g{n) f(n) = 0(g(n)); suggests that the implied constant c is small 

f(n) » g(n) g{n) < f(n) 

fix) ~ ^q ' aj/xi fix) — ^2q aj/xi = 0{l/x n+l ) asi-> +oo 
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xxx .yyy p a number xxx.yyy written in base p; 

for example, the decimal number 3.25 is H.OI2 in binary 

1/27T l/(27r) (multiplication has higher precedence than division) 

j-r -7— -7— • ■ ■ continued fraction a/(b + c/(d + e/(f + ■■■))) 



\z\ absolute value of a scalar z 

\A\ determinant of a matrix A, for example: 

c a 



ad — be 
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Chapter 1 
Integer Arithmetic 



In this Chapter our main topic is integer arithmetic. However, 
we shall see that many algorithms for polynomial arithmetic are 
similar to the corresponding algorithms for integer arithmetic, 
but simpler due to the lack of carries in polynomial arithmetic. 
Consider for example addition: the sum of two polynomials of 
degree n always has degree n at most, whereas the sum of two 
n-digit integers may have n + 1 digits. Thus we often describe 
algorithms for polynomials as an aid to understanding the corre- 
sponding algorithms for integers. 



1.1 Representation and Notations 

We consider in this Chapter algorithms working on integers. We shall distin- 
guish between the logical — or mathematical — representation of an integer, 
and its physical representation on a computer. Our algorithms are intended 
for "large" integers -- they are not restricted to integers that can be repre- 
sented in a single computer word. 

Several physical representations are possible. We consider here only the 
most common one, namely a dense representation in a fixed base. Choose 
an integral base (3 > 1. (In case of ambiguity, j3 will be called the internal 
base.) A positive integer A is represented by the length n and the digits a, 
of its base j3 expansion: 

A = a n _i/T _1 + • • • + oi/9 + ao, 

15 
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where < a t < f3 — 1, and a n _i is sometimes assumed to be non-zero. 
Since the base (3 is usually fixed in a given program, only the length n 
and the integers (ai)o<i<n need to be stored. Some common choices for (3 
are 2 32 on a 32-bit computer, or 2 64 on a 64-bit machine; other possible 
choices are respectively 10 9 and 10 19 for a decimal representation, or 2 53 
when using double precision floating-point registers. Most algorithms given 
in this Chapter work in any base; the exceptions are explicitly mentioned. 

We assume that the sign is stored separately from the absolute value, 
which is known as the "sign-magnitude" representation. Zero is an important 
special case; to simplify the algorithms we assume that n = if A = 0, and 
in most cases we assume that this case is treated separately. 

Except when explicitly mentioned, we assume that all operations are off- 
line, i.e., all inputs (resp. outputs) are completely known at the beginning 
(resp. end) of the algorithm. Different models include on-line — also called 
lazy — algorithms, and relaxed algorithms. 

1.2 Addition and Subtraction 

As an explanatory example, here is an algorithm for integer addition. In the 
algorithm, d is a carry bit. 

Our algorithms are given in a language which mixes mathematical nota- 
tion and syntax similar to that found in many high-level computer languages. 
It should be straightforward to translate into a language such as C. The line 
numbers are included in case we need to refer to individual lines in the de- 
scription or analysis of the algorithm. 

Algorithm 1 Integer Addition 

Input: A = J2o~ a ifi\ B = J2o~ i(3\ carry-in < d m < 1 

Output: C := Yto' 1 dft and < d < 1 such that A + B + d in = d{J n + C 



\Ai * "'in 

for % from to n — 1 do 

s <— ai + bi + d 
Ci <— s mod (3 
d <— s div j3 

return C, d. 



Let T be the number of different values taken by the data type represent- 
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ing the coefficients a^ 6j. (Clearly j3 < T but equality does not necessarily 
hold, e.g., (3 = 10 9 and T = 2 32 .) At step El the value of s can be as large as 
2j3 — 1, which is not representable if /3 — T. Several workarounds are possible: 
either use a machine instruction that gives the possible carry of a\ + bf, or 
use the fact that, if a carry occurs in a, + b i} then the computed sum --if 
performed modulo T -- equals t := Oj + 6j — T < af, thus comparing t and 
a, will determine if a carry occurred. A third solution is to keep a bit in 
reserve, taking j3 < \T/2\ . 

The subtraction code is very similar. Step El simply becomes s <— ai — 
bi — d, where d G {0, 1} is the borrow of the subtraction, and —j3 < s < j3. 
The other steps are unchanged, with the invariant A — B — d in = —dj3 n + C. 

Addition and subtraction of n-word integers costs 0(n), which is negli- 
gible compared to the multiplication cost. However, it is worth trying to 
reduce the constant factor implicit in this 0(n) cost; indeed, we shall see in 
< J1 .31 that "fast" multiplication algorithms are obtained by replacing multi- 
plications by additions (usually more additions than the multiplications that 
they replace). Thus, the faster the additions are, the smaller the thresholds 
for changing over to the "fast" algorithms will be. 



1.3 Multiplication 

A nice application of large integer multiplication is the Kronecker-Schonhage 
trick, also called segmentation or substitution by some authors. Assume we 
want to multiply two polynomials A(x) and B(x) with non- negative integer 
coefficients (see Ex. II .11 for negative coefficients). Assume both polynomials 
have degree less than n, and coefficients are bounded by p. Now take a power 
X = j3 k of the base j3, np 2 < X, and multiply the integers a = A(X) and 
b = B(X) obtained by evaluating A and B at x = X. If C(x) = A(x)B(x) = 
^2ciX l , we clearly have C(X) = ^CiX 1 . Now since the q are bounded by 
np 2 < X, the coefficients Cj can be retrieved by simply "reading" blocks of k 
words in C(X). Assume for example one wants to compute 

(6x 5 + 6x 4 + 4x 3 + 9x 2 + x + 3)(7x 4 + x 3 + 2x 2 + x + 7), 

with degree less than n = 6, and coefficients bounded by p = 9. One can 
thus take X = 10 3 > np 2 , and perform the integer multiplication: 

6006004009001003 x 7001002001007 = 42048046085072086042070010021, 
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from which one can read the product 42x 9 + 48x 8 + 46x 7 + 85x 6 + 72x 5 + 
86x 4 + 42x 3 + 70x 2 + 10a; + 21. 

Conversely, suppose we want to multiply two integers a = Eo<j<n a>i(3 % 
and b = J2o<j<nbjP j - Multiply the polynomials A(x) = Eo<i<n ^ and 
B(x) = J2o<j <n bjX^ , obtaining a polynomial C(x), then evaluate C{x) at 
x = j3 to obtain ab. Note that the coefficients of C(x) may be larger than j3, 
in fact they may be up to about nj3 2 . For example with a = 123 and b = 456 
with j3 = 10, we obtain A(x) = x 2 + 2x + 3, -B(x) = 4x 2 + 5x + 6, whose 
product is C(x) = 4x 4 + 13x 3 + 28x 2 + 27x + 18, and C(10) = 56088. These 
examples demonstrate the analogy between operations on polynomials and 
integers, and also show the limits of the analogy. 

A common and very useful notation is to let M(n) denote the time to mul- 
tiply n-bit integers, or polynomials of degree n — 1, depending on the context. 
In the polynomial case, we assume that the cost of multiplying coefficients is 
constant; this is known as the arithmetic complexity model, whereas the bit 
complexity model also takes into account the cost of multiplying coefficients, 
and thus their bit-size. 



1.3.1 Naive Multiplication 



Algorithm 2 BasecaseMultiply 



Input: A = £™" aS\ B = Eo h& 



Output: C = AB := E" c k f3 k 

l: C <- A ■ b 

2: for j from 1 to n — 1 do do 
3: d-C + P^A-bj) 
4: return C. 



Theorem 1.3.1 Algorithm BasecaseMultiply computes the product AB 
correctly, and uses Q(mn) word operations. 

The multiplication by /3 J at step El is trivial with the chosen dense represen- 
tation: it simply requires shifting by j words towards the most significant 
words. The main operation in algorithm BasecaseMultiply is the compu- 
tation oiA-bj and its accumulation into C at step El Since all fast algorithms 
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rely on multiplication, the most important operation to optimize in multiple- 
precision software is thus the multiplication of an array of m words by one 
word, with accumulation of the result in another array of m + 1 words. 

Since multiplication with accumulation usually makes extensive use of the 
pipeline, it is best to give it arrays that are as long as possible, which means 
that A rather than B should be the operand of larger size (i.e., m> n). 

1.3.2 Karatsuba's Algorithm 

In the following, no > 2 denotes the threshold between naive multiplication 
and Karatsuba's algorithm, which is used for no- word and larger inputs. The 
optimal "Karatsuba threshold" no can vary from 10 to 100 words depending 
on the processor, and the relative efficiency of the word multiplication and 
addition (see Ex. I1.5J1 . 

Algorithm 3 KaratsubaMultiply. 

Input: A = Eo"W\ B = ET^ 
Output: C = AB := j^' 1 c k P k 

If n < n then return BasecaseMultiply(A, B) 

k <- Tn/2] 

(A), B ) := (A, B) mod (3\ (A x , B x ) := (A, B) div (3 k 

s A <- sign(A - A x ), s B <- sign(B - B x ) 

Cq <— KaratsubaMultiply (Aq, Bq) 

d <- KaratsubaMultiply (A 1: Si) 

C 2 *- KaratsubaMultiply ( | A - A x \, \B - B x \) 

return C := C + (C + C x - s A s B C 2 )P k + C^ 2k . 



Theorem 1.3.2 Algorithm KaratsubaMultiply computes the product AB 
correctly, using K{n) = 0(n a ) word multiplications, with a = log 2 3 ~ 1.585. 

Proof. Since SyijAo — Ai| = Aq — Ai, and similarly for B, saSb\Aq — Ai\\Bo — 
Bi| = (A Q - A X )(B - B x ), thus C = A B + (A Q B 1 + A^)^ + A 1 B 1 ^ k . 
Since A Q , B , \A — Ai\ and \B — Bi\ have (at most) [n/2] words, and Ai 
and Bx have (at most) |_rz/2j words, the number K(n) of word multiplications 
satisfies the recurrence K{n) = n 2 for n < no, and K(n) = 2K(\n/2~\) + 
K(\n/2\) for n > n . Assume 2 e ~ 1 n < n < 2 e n with I > 1, then K(n) 
is the sum of three K(j) values with j < 2' _1 no, . . . , thus of 3 Z K(j) with 



20 Modern Computer Arithmetic, version 0.3 of June 10, 2009 

j < n . Thus K(n) < 3 l m&x(K(n ), (n — l) 2 ), which gives K(n) < Cn a 
with C = 3 1 - log 2 no max(X(no), (n - l) 2 )- D 

Different variants of Karatsuba's algorithm exist; this variant is known as 
the subtractive version. Another classical one is the additive version, which 
uses A + A\ and B + Bi instead of \A — Ai\ and \B — Bi\. However, the 
subtractive version is more convenient for integer arithmetic, since it avoids 
the possible carries in Aq + A\ and Bq + B\, which require either an extra 
word in those sums, or extra additions. 

The efficiency of an implementation of Karatsuba's algorithm depends 
heavily on memory usage. It is quite important to avoid allocating memory 
for the intermediate results \A — A x \, \B — Bi\, C , C\, and C 2 at each step 
(although modern compilers are quite good at optimising code and removing 
unnecessary memory references). One possible solution is to allow a large 
temporary storage of m words, used both for the intermediate results and 
for the recursive calls. It can be shown that an auxiliary space of m = 2n 
words — or even m = n in the polynomial case — is sufficient (see Ex. I1.6J) . 

Since the third product Ci is used only once, it may be faster to have 
two auxiliary routines KaratsubaAddmul and KaratsubaSubmul that 
accumulate their result, calling themselves recursively, together with Karat- 
subaMultiply (see Ex.O]). 

The above version uses ~ 4n additions (or subtractions): 2x ^ to compute 
\A — Ai\ an d | -Bo - -Bi|> then n to add Cq and C\, again n to add or subtract 
C 2 , and n to add (C + C\ — saSbC 1 !)^ to C + Cij3 2k . An improved scheme 
uses only ~ |n additions (see Ex. 11.71) . 

Most fast multiplication algorithms can be viewed as evaluation/interpolation 
algorithms, from a polynomial point of view. Karatsuba's algorithm regards 
the inputs as polynomials Aq + A\x and Bq + B\x evaluated at x = j3 k ; since 
their product C(x) is of degree 2, Lagrange's interpolation theorem says that 
it is sufficient to evaluate it at three points. The subtractive version evaluates 
C(x) at x = 0, —1, oo, whereas the additive version uses x = 0, +1, oo^J 

1.3.3 Toom-Cook Multiplication 

Karatsuba's idea readily generalizes to what is known as Toom-Cook r-way 
multiplication. Write the inputs as ao + - ■ ■ + a r -ix r ~ 1 and bo + - • • + & r -i£ r_1 , 



1 Evaluating C(x) at oo means computing the product A\Bi of the leading coefficients. 
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with x = j3 k , and k = \n/r~\. Since their product C(x) is of degree 2r — 2, 
it suffices to evaluate it at 2r — 1 distinct points to be able to recover C(x), 
and in particular C{j3 k ). 

Most references, when describing subquadratic multiplication algorithms, 
only describe Karatsuba and FFT-based algorithms. Nevertheless, the Toom- 
Cook algorithm is quite interesting in practice. 

Toom-Cook r-way reduces one n-word product to 2r — 1 products of \n/r~\ 
words. This gives an asymptotic complexity of 0(n u ) with v = ° g lo r ~ • 
However, the constant hidden by the big-0 notation depends strongly on 
the evaluation and interpolation formulae, which in turn depend on the cho- 
sen points. One possibility is to take — (r — 1), . . . , —1, 0, 1, . . . , (r — 1) as 
evaluation points. 

The case r = 2 corresponds to Karatsuba's algorithm ( fll.3.2|) . The 
case r = 3 is known as Toom-Cook 3-way, sometimes simply called "the 
Toom-Cook algorithm". The following algorithm uses evaluation points 
0, 1, —1, 2, oo, and tries to optimize the evaluation and interpolation formulae. 

Algorithm 4 ToomCook3 

Input: two integers < A, B < j3 n . 

Output: AB := c + c^ + c 2 (3 2k + c 3 (3 3k + c 4 (3 4k with k = \n/3] . 

1: If n < 3 then return KaratsubaMultiply(A, B) 

2: Write A = do + a>\X + CL2X 2 , B = b$ + b\x + fr^ 2 with x = f3 k . 

3: vo <— ToomCook3(ao, bo) 

4: Vi <— ToomCook3(a 2 + Oi, b 02 + b±) where a 02 ^ a + a 2 , b 02 <— b + b 2 

5: v _i <— ToomCook3(a 02 — a\, b 02 — bi) 

6: v 2 <— ToomCook3(a + 2a x + 4a 2 , b + 2b x + 4fe 2 ) 

7: Woo *— ToomCook3(a2, 62) 

8: ti <- (3w + 2u_i + v 2 )/6 - 2 Voo , t 2 <- Oi + u_i)/2 

9: C <- Uo, Cl <- Ui - £1, C 2 <— t 2 - Vo ~ foo, C 3 <— t\ ~ t 2 , C 4 <- Woo 

The divisions at step |H1 are exact; if /3 is a power of two, the division by 
6 can be done using a division by 2 - - which consists of a single shift - 
followed by a division by 3 ( fll.4.7Jl . 

For higher order Toom-Cook implementations see |181j . which considers 
the 4-way and 5-way variants, together with squaring. Toom-Cook r-way 
has to invert a (2r — 1) x (2r — 1) Vandermonde matrix with parameters the 
evaluation points; if one chooses consecutive integer points, the determinant 
of that matrix contains all primes up to 2r — 2. This proves that the division 
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by 3 can not be avoided for Toom-Cook 3- way with consecutive integer points 
(see Ex. II .121 for a generalization of this result). 

1.3.4 Fast Fourier Transform (FFT) 

Most subquadratic multiplication algorithms can be seen as evaluation-inter- 
polation algorithms. They mainly differ in the number of evaluation points, 
and the values of those points. However the evaluation and interpolation 
formulae become intricate in Toom-Cook r-way for large r, since they involve 
0(r 2 ) scalar operations. The Fast Fourier Transform (FFT) is a way to per- 
form evaluation and interpolation in an efficient way for some special points 
(roots of unity) and special values of r. This explains why multiplication 
algorithms of the best asymptotic complexity are based on the Fast Fourier 
Transform. 

There are different flavours of FFT multiplication, depending on the ring 
where the operations are performed. Schonhage-Strassen's algorithm |153j . 
with a complexity of 0(n log n log log n), works in the ring Z/(2" + 1)Z; since 
it is based on modular computations, we describe it in Chapter |21 

Other commonly used algorithms work with floating-point complex num- 
bers J1101 Section 4.3.3.C]; one drawback is that, due to the inexact nature of 
floating-point computations, a careful error analysis is required to guarantee 
the correctness of the implementation, assuming an underlying arithmetic 
with rigorous error bounds (cf Chapter EJ). 

We say that multiplication is in the FFT range if n is large and the 
multiplication algorithm satisfies and M(2n) ~ M(n). For example, this is 
true if the Schonhage-Strassen multiplication algorithm is used, but not if 
the classical algorithm or Karatsuba's algorithm is used. 

1.3.5 Unbalanced Multiplication 

The subquadratic algorithms considered so far (Karatsuba and Toom-Cook) 
work with equal-size operands. How do we efficiently multiply integers of 
different sizes with a subquadratic algorithm? This case is important in 
practice but is rarely considered in the literature. Assume the larger operand 
has size to, and the smaller has size n < m, and denote by M(m,n) the 
corresponding multiplication cost. 

When m is an exact multiple of n, say m = kn, a trivial strategy is to 
cut the larger operand into k pieces, giving M(kn,n) = kM(n) + O(kn). 
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However, this is not always the best strategy, see Ex. 11.141 

When m is not an exact multiple of n, different strategies are possible. 
Consider for example Karatsuba multiplication, and let K(m,ri) be the num- 
ber of word-products for an m x n product. Take for example m = 5, n = 3. 
A natural idea is to pad the smallest operand to the size of the largest one. 
However there are several ways to perform this padding, as shown in the 
Figure, where the "Karatsuba cut" is represented by a double column: 
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The first strategy leads to two products of size 3, i.e., 2A(3, 3), the second one 
to K(2, 1)+K(S, 2)+K(3, 3), and the third one to K(2, 2) +K(3, 1) +K(3, 3), 
which give respectively 14, 15, 13 word products. 

However, whenever m/2 < n < m, any such "padding strategy" will 
require K(\~mj2\, [m/2]) for the product of the differences (or sums) of the 
low and high parts from the operands, due to a "wrap around" effect when 
subtracting the parts from the smaller operand; this will ultimately lead to 
a cost similar to that of an m x m product. The "odd-even scheme" (see 
also Ex. Il.ll|) avoids this wrap around. Here is an example of Algorithm 
OddEvenKaratsuba for m = 3 and n = 2. Take A = a 2 x 2 + a±x + a 



Algorithm 5 OddEvenKaratsuba 



1 then return Y^Jn a iboX l 



Input: A = \\^ di.x 1 , B = Y^o bjx\ m > n 
Output: A ■ B 

If 72: 

fc«-rfu < -m 

Write A = A (x 2 ) + xA^x 2 ), B 

A <— A mod x k , Ai <— A div x k 

Co <— OddEvenKaratsuba (Ao, -Bo) 

G\ <— OddEvenKaratsuba (A +A 1 ,B 

C 2 <— 0ddEvenKaratsuba(Ai,5i) 

return C (x 2 ) + x(C x - C - C 2 )(x 2 ) + x 2 C 2 (x 2 ). 



B (x 2 ) + xB^x 7 



Bi) 



and B = b\x + bo- This yields A = a 2 x + ao, A\ = ai, -Bo = ^o, -Bi = &i, 
thus Co = [a 2 x + a )b , C\ = (a 2 x + ao + ai)(6o + bi), C 2 = a\b\. We thus 
get A^(3,2) = 2A(2, 1) + A(l) = 5 with the odd-even scheme. The general 



24 Modern Computer Arithmetic, version 0.3 of June 10, 2009 

recurrence for the odd-even scheme is: 

K(m,n) = 2K(\m/2], \n/2\) + K([m/2\, [n/2\), 
instead of 

K{m, n) = 2K( \m/2] , \m/2] ) + K( \m/2\ , n - \m/2] ) 

for the classical strategy, assuming n > m/2. The second parameter in K(-, •) 
only depend on the smaller size n for the odd-even scheme. 

As for the classical strategy, there are several ways of padding with the 
odd-even scheme. Consider m = 5, n = 3, and write A := a 4 x 4 +a 3 x 3 +a 2 x 2 + 
aix + ao = xAi(x 2 ) + A (x 2 ), with Ai(x) = a^x + a±, A (x) = a^x 2 + a 2 x + a ; 
and B := b 2 x 2 + b±x + b = xBi(x 2 ) + B (x 2 ), with B\{x) = b\, B (x) = 
b 2 x + b . Without padding, we write AB = x 2 (A 1 B 1 )(x 2 ) + x((A + A 1 )(B + 
B x ) - A 1 B 1 - A B ){x 2 ) + {A B ){x 2 ), which gives K{5, 3) = K{2, 1) + 
2K(3,2) = 12. With padding, we consider xB = xB[{x 2 ) + B' (x 2 ), with 
B[(x) = b 2 x + b , B' = hx. This gives K(2,2) = 3 for A X B' X , K(3,2) = 5 
for (Aq + yli)(i3Q + U^), and K(3, 1) = 3 for AqB' — taking into account the 
fact that B' Q has only one non-zero coefficient — thus a total of 11 only. 

1.3.6 Squaring 

In many applications, a significant proportion of the multiplications have 
equal operands, i.e., are squarings. Hence it is worth tuning a special squar- 
ing implementation as much as the implementation of multiplication itself, 
bearing in mind that the best possible speedup is two (see Ex. I1.15|) . 

For naive multiplication, Algorithm BasecaseMultiply (* U.3.1|) can be 
modified to obtain a theoretical speedup of two, since only about half of the 
products dibj need to be computed. 

Subquadratic algorithms like Karatsuba and Toom-Cook r-way can be 
specialized for squaring too. In general, the threshold obtained is larger than 
the corresponding multiplication threshold. 

1.3.7 Multiplication by a Constant 

It often happens that one integer is used in several consecutive multipli- 
cations, or is fixed for a complete calculation. If this constant multiplier is 
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small, i.e., less than the base j3, not much speedup can be obtained compared 
to the usual product. We thus consider here a "large" constant multiplier. 

When using evaluation-interpolation algorithms, like Karatsuba or Toom- 
Cook (see ^1.3.2hfT".3.3[) . one may store the results of the evaluation for that 
fixed multiplicand. If one assumes that an interpolation is as expensive as 
one evaluation, this may give a speedup of up to 3/2. 

Special-purpose algorithms also exist. These algorithms differ from clas- 
sical multiplication algorithms because they take into account the value of 
the given constant multiplier, and not only its size in bits or digits. They 
also differ in the model of complexity used. For example, Bernstein's algo- 
rithm |2I], which is used by several compilers to compute addresses in data 
structure records, considers as basic operation x, y i— ► 2 l x ± y, with a cost 
assumed to be independent of the integer i. 

For example Bernstein's algorithm computes 20061a; in five steps: 

X\ := 31a; = 2 5 x — x 

X2 := 93a; = 2 1 xi + X\ 

Xj, := 743a; = 2 3 X2 — x 

a; 4 := 6687a; = 2 3 a; 3 + x 3 

20061a; = 2 1 x 4 + z 4 . 

See |116j for a comparison of different algorithms for the problem of multi- 
plication by an integer constant. 



1.4 Division 

Division is the next operation to consider after multiplication. Optimizing 
division is almost as important as optimizing multiplication, since division is 
usually more expensive, thus the speedup obtained on division will be more 
significant. On the other hand, one usually performs more multiplications 
than divisions. 

One strategy is to avoid divisions when possible, or replace them by 
multiplications. An example is when the same divisor is used for several 
consecutive operations; one can then precompute its inverse (see < J2.3.1|) . 

We distinguish several kinds of division: full division computes both quo- 
tient and remainder, while in some cases only the quotient (for example when 
dividing two floating-point mantissas) or remainder (when multiplying two 
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residues modulo n) is needed. Then we discuss exact division -- when the 
remainder is known to be zero — and the problem of dividing by a constant. 

1.4.1 Naive Division 

In all division algorithms, we will consider normalized divisors. We say that 
B := X^o~ bjft is normalized when its most significant word 6 n _i satisfies 
b n -i > P/2. This is a stricter condition (for j3 > 2) than simply requiring 
that b n -i be nonzero. 

Algorithm 6 BasecaseDivRem 

Input: A = Y^o m ~ a>il3 % , B = ^o~ bjft, B normalized 
Output: quotient Q and remainder R of A divided by B. 

l: It A> f3 m B then q m *-l,A<-A- (3 rn B else q m <- 

2: for j from m — \ downto do 

3: q* <— l(a n+ jj3 + o n+ j_i)/6 n _iJ > quotient selection step 

4: g,-<-min(g;,^-l) 

5: A <- A - g^fi 

6: while yl < do 

7: 9j <- ^ - 1 

8: ^ <- ^ + /?Jfi 

9: return Q = J2™<ljP J , R = A. 
(Note: in the above algorithm, Oj denotes the current value of the i-th word 
of A, after the possible changes at steps |5] and [HJ) 

If U is not normalized, we can compute A' = 2 k A and B' = 2 k B so that 
B' is normalized, then divide A' by -B' giving A 1 = Q'B'+R'; the quotient and 
remainder of the division of A by B are respectively Q := Q' and i? := R'/2 k , 
the latter division being exact. 

Theorem 1.4.1 Algorithm BasecaseDivRem correctly computes the quo- 
tient and remainder of the division of A by a normalized B, in 0{nm) word 
operations. 

Proof. First prove that the invariant A < f3^ +l B holds at stepEJ This holds 
trivially for j = m — 1: B being normalized, A < 2j3 m B initially. 

First consider the case qj = q*: then gjfe n _i > a n+ jj3 + a n+J _i — b n _i + 1, 
thus 

A - qj (3 j B < (bn-i ~ l)/3 n+J_1 + (A mod ^ n+j ~ 1 ), 
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which ensures that the new a n+ j vanishes, and a n+ j_i < 6 n _i, thus A < ftB 
after step |5j Now A may become negative after step 03 but since qjb n -\ < 

&n+jP + a n+j-l '■ 

A - qflB > (a n+j (1 + a n+j _ x )^- 1 - q 3 {b n _^ n - 1 + /T" 1 )/?' > -q^~\ 

Therefore A-q j (3 j B + 2(3 j B > (2b n _ 1 -q j )p n+j - 1 > 0, which proves that the 
while-loop at steps H3|H] is performed at most twice jllUl Theorem 4.3.1.B]. 
When the while-loop is entered, A may increase only by j3 ] B at a time, hence 
A < (3 j B at exit. 

In the case qj ^ q*, i.e., q* > (3, we have before the while-loop: A < 
f]i+ l B — [j3 — 1)(3^B = ftB, thus the invariant holds. If the while- loop is 
entered, the same reasoning as above holds. 

We conclude that when the for-loop ends, < A < B holds, and since 
(X)? 1 Qjft)B + A is invariant through the algorithm, the quotient Q and 
remainder R are correct. 

The most expensive step is step which costs 0(n) operations for q^B 
(the multiplication by ft is simply a word-shift); the total cost is 0(nm). rj 

Here is an example of algorithm BasecaseDivRem for the inputs 
A = 766970544842443844 and B = 862664913, with f3 = 1000: which gives 
quotient Q = 889071217 and remainder R = 778334723. 
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Algorithm BasecaseDivRem simplifies when A < (3 m B: remove step^ 
and change m into m — 1 in the return value Q. However, the more general 
form we give is more convenient for a computer implementation, and will be 
used below. 

A possible variant when q* > j3 is to let q 3 - = /3; then A — qjftB at step 
|5] reduces to a single subtraction of B shifted by j + 1 words. However in 
this case the while-loop will be performed at least once, which corresponds 
to the identity A-{p- \)&B = A - ft +1 B + ftB. 

If instead of having B normalized, i.e., b n > (3/2, one has b n > (3/k, there 
can be up to k iterations of the while-loop (and step Q] has to be modified). 
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A drawback of algorithm BasecaseDivRem is that the test A < at 
line IH1 is true with non-negligible probability, therefore branch prediction 
algorithms available on modern processors will fail, resulting in wasted cycles. 
A workaround is to compute a more accurate partial quotient, and therefore 
decrease the proportion of corrections to almost zero (see Ex. 11.181) . 



1.4.2 Divisor Preconditioning 

Sometimes the quotient selection — step|3]of Algorithm BasecaseDivRem 
- is quite expensive compared to the total cost, especially for small sizes. 
Indeed, some processors do not have a machine instruction for the division 
of two words by one word; one way to compute q* is then to precompute a 
one- word approximation of the inverse of b n -i, and to multiply it by a n+ j(3 + 

a n+j-l- 

Svoboda's algorithm |161j makes the quotient selection trivial, after pre- 
conditioning the divisor. The main idea is that if fe n _i equals the base /3 in 
Algorithm BasecaseDivRem, then the quotient selection is easy, since it 
suffices to take q* = a n+ j. (In addition, q* < j3 — 1 is then always fulfilled, 
thus stepHJof BasecaseDivRem can be avoided, and q* replaced by qj.) 

Algorithm 7 SvobodaDivision 

Input: A = Eo +m_1 a ^^ B = Eo" 1 h iP normalized, A < l3 m B 
Output: quotient Q and remainder R of A divided by B. 
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k <- \p n+1 /B] 

B' i- kB = (3 n+1 + Y^o' 1 h 'ft 3 

for j from m — 1 downto 1 do 

Qj * @"n+j 

A^- A- q J f3 j ~ 1 B' 
if A < then 

Qj <~ Qj - 1 

A <- A + ft- x B' 

Q' = Y.r l Q^'R' = A 

(g , R) <- (R' div B, R' mod B) 
return Q = kQ' + q , R. 



The division at step El can be performed with BasecaseDivRem; it gives 
a single word since A has n + 1 words. 
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With the example of section fll.4.11 Svoboda's algorithm would give k = 
1160, B' = 1000691299080: 

j A qj A — qjB' ft after correction 

2 766 970 544 842 443 844 766 441009 747163 844 no change 
1 441009 747163 844 441 -295115 730 436 705 575 568 644 

We thus get Q' = 766440 and R' = 705575568644. The final division of 
StepEHgives R' = 8175 + 778334723, thus we get Q = 1160-766440 + 817 = 
889071217, and R = 778334723, as in 3TXT1 

Svoboda's algorithm is especially interesting when only the remainder is 
needed, since then one can avoid the "deconditioning" Q = kQ' + q . Note 
that when only the quotient is needed, dividing A' = kA by B' = kB yields 
it. 

1.4.3 Divide and Conquer Division 

The base-case division from i ll .4. II determines the quotient word by word. A 
natural idea is to try getting several words at a time, for example replacing 
the quotient selection step in Algorithm BasecaseDivRem by: 

* I a n+j(3 + a n +j-lP + 0'n+j-2P + ttn+j-3 , 

^ " L &„-!/? + fcn-2 J ' 

Since q* has then two words, fast multiplication algorithms ( ^1-3|) might 
speed up the computation of qjB at step |5] of Algorithm BasecaseDivRem. 
More generally, the most significant half of the quotient -- say Qi, of k 
words - - mainly depends on the k most significant words of the dividend 
and divisor. Once a good approximation to Qi is known, fast multiplication 
algorithms can be used to compute the partial remainder A — Q\B. The 
second idea of the divide and conquer division algorithm below is to compute 
the corresponding remainder together with the partial quotient Q\\ in such 
a way, one only has to subtract the product of Q\ by the low part of the 
divisor, before computing the low part of the quotient. 

In Algorithm RecursiveDivRem, one may replace the condition m < 2 

at step ^ by m < T for any integer T > 2. In practice, T is usually in the 
range 50 to 200. 

One can not require A < j3 m B here, since this condition may not be 
satisfied in the recursive calls. Consider for example A = 5517, B = 56 
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Algorithm 8 RecursiveDivRem 

Input: A = Ylo +m ~ a iP l i B = ^o~ &j'/^'> B normalized, n> m 
Output: quotient Q and remainder R of A divided by B. 
1: if m < 2 then return BasecaseDivRem(A, B) 

k <- [f J, Bi <- B div (3 k , B ^- B mod (3 k 

(Qi, i?i) <- RecursiveDivRem(A div /? 2fc , B x ) 

A' <- i?!/3 2fc + (A mod (3 2k ) - QiB /3 fc 

while A' < do Q 1 <- Qi - 1, A' <- A' + /3 fc B 

(<5o, B ) <- RecursiveDivRem (A' div /3 fc , B x ) 



A" <- B /3 fc + (A' mod /? fc ) - Q B 

while A" < do Q <- <3o - 1, A" <- A" + B 

return Q := Q 1( 5 fc + Q , R := A". 



with /3 = 10: the first recursive call will divide 55 by 5, which yields a 
two-digit quotient 11. Even A < (3 m B is not recursively fulfilled; consider 
A = 55170000 with B = 5517: the first recursive call will divide 5517 by 55. 
The weakest possible condition is that the n most significant words of A do 
not exceed those of B, i.e., A < j3 m (B + 1). In that case, the quotient is 
bounded by f3 m + [ - B ~ j , which yields f3 m + 1 in the case n = m (compare 
Ex. lLTTj) . See also Ex. QUI 

Theorem 1.4.2 Algorithm RecursiveDivRem is correct, and uses D(n + 
m, n) operations, where D{n + m,n) = 2D(n, n — m/2) + 2M(m/2) + 0(n). 
In particular D(n) := D(2n,n) satisfies D(n) = 2D(n/2) + 2M (n/2) + 0(n), 
which gives D[n) ~ 2a _ 1 i_ 1 M(n) for M[n) ~ n a , a > 1. 

Proof. We first check the assumption for the recursive calls: B\ is normal- 
ized since it has the same most significant word than B. 

After step we have A = (Q\Bi + Ri)[3 2k + {A mod p2k)> thus after step 
HJ A' = A — Qi(5 k B, which still holds after step |5j After step El we have 
A' = (Q B 1 + R )p k + (A' mod/? fc ), thus after step[7| A" = A'-Q B, which 
still holds after step |HJ At step |U] we thus have A = QB + R. 

A div j3 2k has m + n — 2k words, while B\ has n — k words, thus < Q\ < 
2(3 m - k and < R l < B t < (3 n ~ k . Thus at step H -2(3 m+k < A' < (3 k B. 
Since B is normalized, the while-loop at step |5] is performed at most four 
times (this can happen only when n = m). At step |H] we have < A' < (3 k B, 
thus A' div (3 k has at most n words. 
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It follows < Q < 2/3 k and < R < #i < P n ~ k - Hence at step 
UJ —2j3 2k < A" < B, and after at most four iterations at step El we have 
< A" < B. n 



Theorem 11.4.21 gives D(n) ~ 2M(n) for Karatsuba multiplication, and 
D(n) ~ 2.63 . . . M(n) for Toom-Cook 3-way; in the FFT range, see Ex. FTTH 

The same idea as in Ex. 11.181 applies: to decrease the probability that 
the estimated quotients Qi and Qq are too large, use one extra word of the 
truncated dividend and divisors in the recursive calls to RecursiveDivRem. 

A graphical view of Algorithm RecursiveDivRem in the case m = n 
is given in Fig. II. 1| which represents the multiplication Q ■ B: one first 
computes the lower left corner in D(n/2) (step EJ), second the lower right 
corner in M{n/2) (stepHJ), third the upper left corner in D(n/2) (stepEJ), 
and finally the upper right corner in M(n/2) (step[7J. 
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Figure 1.1: Divide and conquer division: a graphical view (most significant 
parts at the lower left corner). 



Unbalanced Division 

The condition n > m in Algorithm RecursiveDivRem means that the 
dividend A is at most twice as large as the divisor B. 
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When A is more than twice as large as B (m > n with the above nota- 
tions), a possible strategy (see Ex. I1.22J1 computes n words of the quotient 
at a time. This reduces to the base-case algorithm, replacing j3 by (3 n . 

Algorithm 9 UnbalancedDivision 

Input: A = Y2o m ~ a iP l > B = Ylo~ bjft, B normalized, m > n. 
Output: quotient Q and remainder R of A divided by B. 

while m > n do 

(q, r) <— RecursiveDivRem(A div j3 m ~ n , B) > 2n by n division 

Q <- Q(3 n + q 

A <- r(3 m ~ n + A mod (3 m ~ n 

m <— m — n 
(q,r) <— RecursiveDivRem(A, B) 
return Q := Qj3 m + q, R := r. 



1.4.4 Newton's Method 

Newton's iteration gives the division algorithm with best asymptotic com- 
plexity. One basic component of Newton's iteration is the computation of 
an approximate inverse. We refer here to Chapter |U The p-adic version of 
Newton's method, also called Hensel lifting, is used in §1.4.51 for the exact 
division. 



1.4.5 Exact Division 

A division is exact when the remainder is zero. This happens for example 
when normalizing a fraction a/b: one divides both a and b by their greatest 
common divisor, and both divisions are exact. If the remainder is known a 
priori to be zero, this information is useful to speed up the computation of 
the quotient. Two strategies are possible: 

• use classical MSB (most significant bits first) division algorithms, with- 
out computing the lower part of the remainder. Here, one has to take 
care of rounding errors, in order to guarantee the correctness of the 
final result; or 



Modern Computer Arithmetic, §1.4 33 

• use LSB (least significant bits first) algorithms. If the quotient is known 
to be less than /3 n , computing a/b mod (3 n will reveal it. 

In both strategies, subquadratic algorithms can be used too. We describe 
here the least significant bit algorithm, using Hensel lifting — which can be 
seen as a p-adic version of Newton's method: 

Algorithm 10 ExactDivision 

Input: A = Eo" 1 a t [3\ B = £o _1 b~(P 
Output: quotient Q = A/B mod (3 n 

1: C <- 1/&Q mod (3 

2: for % from [log 2 n] — 1 downto 1 do 

3: k <- \n/2 l ] 

4: C <- C + C(l - BC) mod (3 k 

5: Q <- AC mod p k 

6: Q <- Q + C(A - BQ) mod /3 n 

Algorithm ExactDivision uses the Karp-Markstein trick: lines ^E] com- 
pute l/Smod/?^, while the two last lines incorporate the dividend to 
obtain A/B mod (3 n . Note that the middle product (£ 13.3.2(1 can be used in 
lines0]and|ni to speed up the computation of 1 — BC and A—BQ respectively. 

Finally, another gain is obtained using both strategies simultaneously: 
compute the most significant n/2 bits of the quotient using the MSB strategy, 
and the least n/2 bits using the LSB one. Since an exact division of size n is 
replaced by two exact divisions of size n/2, this gives a speedup up to 2 for 
quadratic algorithms (see Ex. 11.25(1 . 

1.4.6 Only Quotient or Remainder Wanted 

When both the quotient and remainder of a division are needed, it is best 
to compute them simultaneously. This may seem to be a trivial statement, 
nevertheless some high-level languages provide both div and mod, but no 
single instruction to compute both quotient and remainder. 

Once the quotient is known, the remainder can be recovered by a single 
multiplication as a — qb; on the other hand, when the remainder is known, 
the quotient can be recovered by an exact division as (a — r)/b (£ 11.4.51) . 

However, it often happens that only one of the quotient or remainder is 
needed. For example, the division of two floating-point numbers reduces to 
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the quotient of their fractions (see Chapter |3J) . Conversely, the multiplica- 
tion of two numbers modulo n reduces to the remainder of their product 
after division by n (see Chapter |2J). In such cases, one may wonder if faster 
algorithms exist. 

For a dividend of 2n words and a divisor of n words, a significant speedup 
- up to two for quadratic algorithms - - can be obtained when only the 
quotient is needed, since one does not need to update the low n words of the 
current remainder (step |5] of Algorithm BasecaseDivRem). 

It seems difficult to get a similar speedup when only the remainder is 
required. One possibility is to use Svoboda's algorithm, but this requires 
some precomputation, so is only useful when several divisions are performed 
with the same divisor. The idea is the following: precompute a multiple 
B\ of B, having 3n/2 words, the n/2 most significant words being [3 n ^ 2 . 
Then reducing A mod B\ reduces to a single n/2 x n multiplication. Once 
A is reduced into A\ of 3n/2 words by Svoboda's algorithm in 2M(n/2), 
use RecursiveDivRem on A\ and B, which costs D(n/2) + M (n/2). The 
total cost is thus 3M(n/2) + D(n/2), instead of 2M(n/2) + 2D(n/2) for a 
full division with RecursiveDivRem. This gives |M(n) for Karatsuba and 
2.04M (n) for Toom-Cook 3-way. A similar algorithm is described in £12.3.21 
(Subquadratic Montgomery Reduction) with further optimizations. 

1.4.7 Division by a Constant 

As for multiplication, division by a constant c is an important special case. 
It arises for example in Toom-Cook multiplication, where one has to perform 
an exact division by 3 f m.3.3ft . We assume here that we want to divide a 
multiprecision number by a one- word constant. One could of course use a 
classical division algorithm ( ^1.4. 1|) . Algorithm ConstantDivide performs 
a modular division: 

A + b(3 n = cQ, 

where the "carry" b will be zero when the division is exact. 

Theorem 1.4.3 The output of Algorithm ConstantDivide satisfies A + 
b(3 n = cQ. 

Proof. We show that after step i, < i < n, we have Ai+b(3 l+1 = cQi, where 
Ai := X^'=o a ifi l an d Qi '■= ELo Qi0 1 ' -^or i = 0, this is a + b(3 = cq , which 
is exactly line[7| since qo = a /c mod /3, q$c — a$ is divisible by /3. Assume 
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Algorithm 11 ConstantDivide 

Input: A = Y^o' 1 a ^> < c < p. 

Output: Q = Y2' 1 qift and < b < c such that A + b(3 n = cQ 
1: d <— 1/c mod /? 



for % from to n — 1 do 

if b < a,i then (x, b') <— (a^ — 6, 0) 
else (x, b') <— (a^ — b + /3, 1) 
5i <— dx mod /3 

yil qjC—x 



b^b' + 6" 
return J]o~ ?#, b. 



now that Aj_i + bj3 l = cQj-i holds for 1 < i < n. We have a; — 6 + 6'/3 = x, 
then x + 6"/? = eft, thus A, + (6' + b")(3 l+1 = A^ + ^(oj + b'(3 + ft"/?) = 
cQ % _ x - b(3 l + (3%x + b - b'(3 + 6'/3 + b" (5) = cQ r _ x + p(x + 6"/?) = cQ,. ' n 

Remark: at lined since < x < j3, b" can also be obtained as \_qic//3\. 

Algorithm ConstantDivide is just a special case of Hensel's division, 
which is the topic of the next section. 

1.4.8 Hensel's Division 

Classical division consists in cancelling the most significant part of the div- 
idend by a multiple of the divisor, while Hensel's division cancels the least 
significant part (Fig. 11.2)) . Given a dividend A of 2n words and a divisor B 
of n words, the classical or MSB (most significant bit) division computes a 
quotient Q and a remainder R such that A = QB + R, while Hensel's or 
LSB (least significant bit) division computes a LSB-quotient Q' and a LSB- 
remainder R' such that A = Q'B + R' j3 n . While the MSB division requires 
the most significant bit of B to be set, the LSB division requires B to be 
relatively prime to the word base f3, i.e., B to be odd for j3 a power of two. 

The LSB-quotient is uniquely defined by Q' = A/B mod j3 n , with < 
Q' < (3 n . This in turn uniquely defines the LSB-remainder R' = (A — 
Q'B)p- r \ with -B < R' < (3 n . 

Most MSB-division variants (naive, with preconditioning, divide and con- 
quer, Newton's iteration) have their LSB-counterpart. For example the 
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Figure 1.2: Classical/MSB division (left) vs Hensel/LSB division (right). 

LSB preconditioning consists in using a multiple of the divisor such that 
kB = 1 mod /?, and Newton's iteration is called Hensel lifting in the LSB 
case. The exact division algorithm described at the end of 91.4.51 uses both 
MSB- and LSB-division simultaneously. One important difference is that 
LSB-division does not need any correction step, since the carries go in the 
direction opposite to the cancelled bits. 

When only the remainder is wanted, Hensel's division is usually known 
as Montgomery reduction (see 92.3.2)) . 

1.5 Roots 

1.5.1 Square Root 

The "paper and pencil" method once taught at school to extract square roots 
is very similar to "paper and pencil" division. It decomposes an integer m 
of the form s 2 + r, taking two digits of m at a time, and finding one digit of 
s for each two digits of m. It is based on the following idea: if m = s 2 + r 
is the current decomposition, when taking two more digits of the root-enqj, 
we have a decomposition of the form 100m + r' = 100s 2 + lOOr + r' with 
< r' < 100. Since (10s + t) 2 = 100s 2 + 20st + £ 2 , a good approximation to 
the next digit t can be found by dividing lOr by 2s. 

Algorithm SqrtRem generalizes this idea to a power /3 of the internal 
base close to m 1 / 4 : one obtains a divide and conquer algorithm, which is in 
fact an error- free variant of Newton's method (c/ Chapter 0J): 



Input of the root operation, like the divid-end for division. 
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Algorithm 12 SqrtRem 

Input: m = a n _i/? n_1 + • • • + Oj/3 + a with a n _i 7^ 

Output: (s, r) such that s 2 < m = s 2 + r < (s + l) 2 

*<-L^J 

if / = then return BasecaseSqrtRem(m) 

write m = a^p 31 + a 2 /? 2 ' + ai/? z + a® with < 02, 01, ao < /3 

(s',r') ^SqrtRem(a 3 /?' + a 2 ) 

(g, u) <-DivRem(r'/? J + Oi, 2s') 

s <- s'/3 z + g 

r <— u/? z + a — g 2 

if r < then 

r <— r + 2s — 1 

s <— s — 1 
return (s,r) 



Theorem 1.5.1 Algorithm SqrtRem correctly returns the integer square 
root s and remainder r of the input m, and has complexity R(2n) ~ R(n) + 
D(n) + S(n) where D(n) and S(n) are the complexities of the division with 
remainder and squaring respectively. This gives R[n) ~ |n 2 with naive 
multiplication, R[n) ~ ^K[n) with Karatsuba's multiplication, assuming 



S(n) ~ |M( 



n, 



As an example, assume Algorithm SqrtRem is called on m = 123456789 
with (3 = 10. One has n = 9, / = 2, a% = 123, a 2 = 45, a\ = 67, and ao = 89. 
The recursive call for a 3 fi l + a 2 = 12345 yields s' = 111 and r' = 24. The 
DivRem call yields q = 11 and u = 25, which gives s = 11111 and r = 2468. 

Another nice way to compute the integer square root of an integer n, 
i.e., L^ 1,/2 J , is Algorithm Sqrtlnt, which is an all-integer version of Newton's 
method (ffij. 

Still with input 123456789, we successively get s = 61728395, 30864198, 
15432100,7716053,3858034, 1929032, 964547, 482337, 241296, 120903, 60962, 
31493, 17706, 12339, 11172, 11111, 11111. The convergence is slow because 
the initial value of s is much too large. However, any initial value greater or 
equal to n 1 / 2 works (see the proof of Algorithm Rootlnt below): starting 
from s = 12000, one gets s = 11144 then s = 11111. 
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Algorithm 13 Sqrtlnt 
Input: an integer n > 1. 
Output: s = [n 1 / 2 ]. 
u <— n 
repeat 
s <— u 

t <— s + [n/s\ 
u <- [t/2j 
until u > s 
return s 



1.5.2 /c-th Root 

The idea of algorithm SqrtRem for the integer square root can be general- 
ized to any power: if the current decomposition is n = n'/3 k + n"j3 k ~ l + n'", 
first compute a k-th root of n 1 , say n' = s fc + r, then divide r/3 + n" by ks k ~ l 
to get an approximation of the next root digit t, and correct it if needed. Un- 
fortunately the computation of the remainder, which is easy for the square 
root, involves O(k) terms for the k-th root, and this method may be slower 
than Newton's method with floating-point arithmetic ( fl4.2.3[) . 

Similarly, algorithm Sqrtlnt can be generalized to the k-th. root: 

Algorithm 14 Rootlnt 

Input: integers n > 1, and k > 2 

Output: s = |n 1/fc J 



u <— n 
repeat 

S i— U 

t<-(k-l)s+ \n/s k ~ l \ 

u *- \t/k\ 
until u > s 
return s 



Theorem 1.5.2 Algorithm Rootlnt terminates and returns [n 1 ^] . 

Proof. As long as u < s in line El the sequence of s- values is decreasing, thus 
it suffices to consider what happens when u > s. First it is easy so see that 
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u> s implies n > s k . Consider now the function f(t) := [(k — l)t + n/t k ~ 1 ]/k 
for t > 0; its derivative is negative for t < n 1 ^, and positive for t > n 1//fc , 
thus f(t) > f(n 1/k ) = n 1/,fc . This proves that s > [n 1 ^}. Together with 
s < n 1 ^, this proves that s = L^ 1 ^] at the end of the algorithm. q 

Note that any initial value > [n 1 '*] works. This also proves the correctness 
of Algorithm Sqrtlnt which is just the special case k = 2. 

1.5.3 Exact Root 

When a fc-th root is known to be exact, there is of course no need to com- 
pute exactly the final remainder in "exact root" algorithms, which saves 
some computation time. However, one has to check that the remainder is 
sufficiently small that the computed root is correct. 

When a root is known to be exact, one may also try to compute it starting 
from the least significant bits, as for exact division. Indeed, if s k = n, then 
s k = n mod f3 for any integer £. However, in the case of exact division, the 
equation a = qb mod f3 has only one solution q as soon as b is relatively 
prime to j3. Here, the equation s k = n mod f3 may have several solutions, 
so the lifting process is not unique. For example, x 2 = 1 mod 2 3 has four 
solutions 1, 3, 5, 7. 

Suppose we have s k = n mod f3 e , and we want to lift to (3 e+1 . This implies 
(a + tf3f) k = n + n'(3 e mod (3 e+1 where < t, ri < (3. Thus 

h 

n — s 
kt = n! H — — mod /?. 

P 

This equation has a unique solution t when k is relatively prime to (3. For 
example we can extract cube roots in this way for j3 a power of two. When 
k is relatively prime to f3, we can also compute the root simultaneously from 
the most significant and least significant ends, as for the exact division. 

Unknown Exponent 

Assume now that one wants to check if a given integer n is an exact power, 
without knowing the corresponding exponent. For example, many factoriza- 
tion algorithms fail when given an exact power, therefore this case has to 
be checked first. Algorithm IsPower detects exact powers, and returns the 
largest corresponding exponent. To quickly detect non-fc-th powers at step El 
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Algorithm 15 IsPower 

Input: a positive integer n. 

Output: k > 2 when n is an exact fc-th power, 1 otherwise. 

1: for k from |log 2 ftJ downto 2 do 

2: if n is a fc-th power, return k 

3: return 1. 



one may use modular algorithms when k is relatively prime to the base j3 
(see above). 

Remark: in Algorithm IsPower, one can limit the search to prime expo- 
nents k, but then the algorithm does not necessarily return the largest expo- 
nent, and we might have to call it again. For example, taking n = 117649, 
the algorithm will first return 3 because 117649 = 49 3 , and when called again 
with 49 it will return 2. 

1.6 Greatest Common Divisor 

Many algorithms for computing gcds may be found in the literature. We can 
distinguish between the following (non-exclusive) types: 

• left-to-right (MSB) versus right-to-left (LSB) algorithms: in the former 
the actions depend on the most significant bits, while in the latter the 
actions depend on the least significant bits; 

• naive algorithms: these 0(n 2 ) algorithms consider one word of each 
operand at a time, trying to guess from them the first quotients; we 
count in this class algorithms considering double-size words, namely 
Lehmer's algorithm and Sorenson's fc-ary reduction in the left-to-right 
and right-to-left cases respectively; algorithms not in that class consider 
a number of words that depends on the input size n, and are often 
subquadratic; 

• subtraction-only algorithms: these algorithms trade divisions for sub- 
tractions, at the cost of more iterations; 

• plain versus extended algorithms: the former just compute the gcd of 
the inputs, while the latter express the gcd as a linear combination of 
the inputs. 
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1.6.1 Naive GCD 

For completeness we mention Euclid's algorithm for finding the gcd of two 
non- negative integers u, v. 

while v ^ do (u, v) <— (v, u mod v); Return u. 

Euclid's algorithm is discussed in many textbooks, e.g., |110j . and we do not 
recommend it in its simplest form, except for testing purposes. Indeed, it is 
a slow way to compute a gcd, except for very small inputs. 

Double-Digit Gcd. A first improvement comes from Lehmer's observa- 
tion: the first few quotients in Euclid's algorithm usually can be determined 
from the two most significant words of the inputs. This avoids expensive divi- 
sions that give small quotients most of the time (see jllOL §4.5.3]). Consider 
for example a = 427,419,669,081 and 6 = 321,110,693,270 with 3-digit 
words. The first quotients are 1, 3, 48, . . . Now if we consider the most signif- 
icant words, namely 427 and 321, we get the quotients 1, 3, 35, .... If we stop 
after the first two quotients, we see that we can replace the initial inputs by 
a - 6 and -3a + 46, which gives 106, 308, 975, 811 and 2, 183, 765, 837. 

Lehmer's algorithm determines cofactors from the most significant words 
of the input integers. Those cofactors usually have size only half a word. The 
DoubleDigitGcd algorithm - - which should be called "double-word" 
uses the two most significant words instead, which gives cofactors t, u, v, w of 
one full- word each. This is optimal for the computation of the four products 
ta, ub, va, wb. With the above example, if we consider 427,419 and 321, 110, 
we find that the first five quotients agree, so we can replace a, b by —148a + 
1976 and 441a - 5876, i.e., 695, 550, 202 and 97, 115, 231. 

Algorithm 16 DoubleDigitGcd 

Input: a := a n ^ n ~ x + ■ ■ ■ + a Q , b := 6 m _ 1/ 5 m " 1 + • • • + 6 . 

Output: gcd(a,6). 

if 6 = then return a 

if m < 2 then return BasecaseGcd(a, 6) 

if a < 6 or n > m then return DoubleDigitGcd(6, a mod 6) 

(£, u, v, w) ^HalfBezout(a n _!/3 + a n _ 2 , 6 n _i/3 + 6 n _ 2 ) 

return DoubleDigitGcd(|ta + ub\, \va + wb\). 
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The subroutine Half Bezout takes as input two 2-word integers, performs 
Euclid's algorithm until the smallest remainder fits in one word, and returns 
the corresponding matrix [t,u;v,w]. 

Binary Gcd. A better algorithm than Euclid's, though still having 0(n 2 ) 
complexity, is the binary algorithm. It differs from Euclid's algorithm in two 
ways: it consider least significant bits first, and it avoids divisions, except for 
divisions by two (which can be implemented as shifts on a binary computer). 

Algorithm 17 BinaryGcd 
Input: a, b > 0. 
Output: gcd(a, b). 

while a mod 2 = b mod 2 = do 
(i,a,b) <- (i + 1, a/2, 6/2) 

while a mod 2 = do 

a <— a/2 

while b mod 2 = do 

b i— 6/2 > now a and 6 are both odd 

while a/6do 

(a, 6) <— (|a — 6|,min(a, 6)) 

repeat a <— a/2 until a mod 2^0 
return 2* • a. 



Sorenson's fc-ary reduction 

The binary algorithm is based on the fact that if a and 6 are both odd, then 
a — b is even, and we can remove a factor of two since 2 does not divide 
gcd(a, 6). Sorenson's fc-ary reduction is a generalization of that idea: given 
a and 6 odd, we try to find small integers u, v such that ua — vb is divisible 
by a large power of two. 

Theorem 1.6.1 |174j If a,b > andm > 1 with gcd(a, m) = gcd(6, m) = I, 
there exist u,v, < \u\, v < \fm such that ua = vb mod m. 



Algorithm ReducedRatMod finds such a pair (u, v); it is a simple variation 
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Algorithm 18 ReducedRatMod 

Input: a,b > 0, m > 1 with gcd(a,m) = gcd(6, m) = 1 

Output: (u, v) such that < \u\,v < \pm and ua = vb mod m 

c <— a/6 mod m 
(wi,vi) <- (0,m) 
(u 2 ,v 2 ) <- (l,c) 
while w 2 > \A™ do 

9 <- [vi/v 2 \ 

(wi,w 2 ) <- (w2,«i - on 2 ) 

(Vl,V 2 ) <- (V2,Vl -9V2) 

return (w 2 ,f 2 ). 



of the extended Euclidean algorithm; indeed, the Wj are denominators from 
the continued fraction expansion of c/m. 

When m is a prime power, the inversion 1/6 mod m at line 4 of Algorithm 
ReducedRatMod can be performed efficiently using Hensel lifting (§ 



Given two integers a, 6 of say n words, Algorithm ReducedRatMod 

with m = j3 2 will yields two integers u, v such that vb — ua is a multiple of 
/3 2 . Since u, v have at most one-word each, a' = (vb — ua)/j3 2 has at most 
n — 1 words - - plus possible one bit - - therefore with 6' = 6 mod a' one 
obtains gcd(a, 6) = gcd(a', 6'), where both a' and 6' have one word less. This 
yields a LSB variant of the double-digit (MSB) algorithm. 

1.6.2 Extended GCD 

Algorithm ExtendedGcd solves the extended greatest common divisor prob- 
lem: given two integers a and 6, it computes their gcd 0, and also two integers 
u and v (called Bezout coefficients or sometimes cofactors or multipliers) 
such that g = ua + vb. If ao and 60 are the input numbers, and a, b the 
current values, the following invariants hold at the start of each iteration of 
the while loop (step 6) and after the while loop (step 12): a = ua + vbo, and 
6 = wa + xb . 

An important special case is modular inversion (see Chapter |2J): given an 
integer n, one wants to compute 1/a mod n for a relatively prime to n. One 
then simply runs algorithm ExtendedGcd with input a and b = n: this 
yields u and v with ua + vn= 1, thus 1/a = w mod n. Since t> is not needed 
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Algorithm 19 Ext ended Gcd 

Input: integers a and b. 

Output: integers (g, u, v) such that g = gcd(a, b) = ua + vb. 



(u,w) <-(l,0) 
(v,x)<-(0,l) 
while b ^ do 

(q,r) <— DivRem(a,6) 
(a,b)<-(b,r) 
(u,w) <— (iu,m — #iu) 
(v ,x) <— (x, v — gx) 

return (a, u, w). 



here, we can simply avoid computing v and x, by removing lines El and [7J 

It may also be worthwhile to compute only u in the general case, as the 
cofactor v can be recovered from v = (g — ua)/b; this division is exact (see 

9123). 

All known algorithms for subquadratic gcd rely on an extended gcd sub- 
routine, so we discuss the subquadratic extended gcd in the next section. 

1.6.3 Half GCD, Divide and Conquer GCD 

Designing a subquadratic integer gcd algorithm that is both mathematically 
correct and efficient in practice appears to be quite a challenging problem. 

A first remark is that, starting from n-bit inputs, there are 0(n) terms in 
the remainder sequence ro = a, r± — b, . . . , r i+ i = r^i mod fj, . . . , and the 
size of Ti decreases linearly with i. Thus computing all the partial remainders 
Ti leads to a quadratic cost, and a fast algorithm should avoid this. 

However, the partial quotients g« = r^-i div n are usually small: the main 
idea is thus to compute them without computing the partial remainders. This 
can be seen as a generalization of the DoubleDigitGcd algorithm: instead 
of considering a fixed base j3, adjust it so that the inputs have four "big 
words" . The cofactor-matrix returned by the HalfBezout subroutine will 
then reduce the input size to about 3n/4. A second call with the remaining 
two most significant "big words" of the new remainders will reduce their 
size to half the input size. This gives rise to the HalfGcd algorithm. Note 
that there are several possible definitions of the half-gcd. Given two n-bit 
integers a and b, HalfGcd returns two consecutive elements a', b' of their 
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Algorithm 20 HalfGcd 
Input: a > b > 

Output: a 2 x 2 matrix R and a', 6' such that [a' b'] = R [a b] 
if a is small, use ExtendedGcd. 
n <— nbits(a), k *— \n/2\ 
a :=ai2 fc + a , b := b x 2 k + b 



S, a 2 , fe 2 *— Half Gcd(oi, &i) 
a' <- a 2 2 k + Siia + Si 2 & 
b' <- b 2 2 k + S^ao + S 22 b 

e <- [k/2\ 

a' := a[2 l + a{,, b' := fe;2^ + fe 
T, a' 2 , b' 2 <- Half Gcd(ai, &i) 
a" <- a' 2 2^ + Tiia'o + T 12 6' 
6 // ^6 / 2 2^ + T 21 a / + T 22 6 / 
return 5" • T, a", 6". 



remainder sequence with bit-size about n/2, (the different definitions differ 
in the exact stopping criterion). In some cases, it is convenient also to return 
the unimodular matrix R such that [a' b'] = R [a b] , where R has entries of 
bit-size n/2. 

Let H(n) be the complexity of HalfGcd for inputs of n bits: a\ and b\ 
have n/2 bits, thus the coefficients of S and a 2 , b 2 have n/4 bits. Thus a', b' 
have 3n/4 bits, a' x , b[ have n/2 bits, a' , b' have n/4 bits, the coefficients of T 
and a' 2 , b' 2 have n/4 bits, and a", b" have n/2 bits. We have H(n) ~ 2H(n/2) + 
AM {n/4, n/2) + 4M (n/4) + 8M (n/4), i.e., #(n) ~ 2H(n/2) + 20M (n/4). If 
we do not need the final matrix S-T, then we have H*(n) ~ H(n) — 8M(n/A). 
For the plain gcd, which simply calls HalfGcd until b is sufficiently small to 
call a naive algorithm, the corresponding cost G(n) satisfies G(n) = H*{n) + 
G(n/2). 

An application of the half gcd per se is the integer reconstruction problem. 
Assume one wants to compute a rational p/q where p and q are known to be 
bounded by some constant c. Instead of computing with rationals, one may 
perform all computations modulo some integer n > c 2 . Hence one will end 
up with p/q = m mod n, and the problem is now to find the unknown p and 
q from the known integer m. To do this, one starts an extended gcd from m 
and n, and one stops as soon as the current a and u are smaller than c: since 
we have a = um + vn, this gives m = —a/u mod n. This is exactly what is 
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naive Karatsuba Toom-Cook FFT 



H{n) 2.5 6.67 

H*(n) 2.0 5.78 

G(n) 2.67 8.67 



9.52 


5 log 2 n 


8.48 


5 log 2 n 


13.29 


10 log 2 n 



Table 1.1: Cost of HalfGcd: H(n) with the cofactor matrix, H*(n) without 
the cofactor matrix, G(n) the plain gcd, all in units of the multiplication cost 
M(n), for naive multiplication, Karatsuba, Toom-Cook and FFT. 

called a half-gcd; a subquadratic version is given above. 

Subquadratic Binary GCD 

The binary gcd can also be made fast, i.e., subquadratic in n. The basic idea 
is to mimic the left-to-right version, by defining an appropriate right-to-left 
division (Algorithm BinaryDivide). 

Algorithm 21 BinaryDivide 

Input: a, b G Z with u(b) — u(a) = j > 

Output: |<j»| < 2 J and r = a + q2~^b such that u(b) < u(r) 

y <- 2~i b 

q < a/b' mod 2 j+1 

iiq> 2 j then q <- q - 2 j+1 

return q, r = a + q2~^b. 

The integer q is the binary quotient of a and b, and r is the binary 
remainder. 

This right-to-left division defines a right-to-left remainder sequence a^ = 

a, Oi = b, ..., where Oj+i = BinaryRemainder (<2j_i, Oj), and z/(aj+i) < 
v(di). It can be shown that this sequence eventually reaches eij+i = for 
some index i. Assuming v(a) = 0, then gcd(a, b) is the odd part of cij. 
Indeed, in Algorithm BinaryDivide, if some odd prime divides both a and 

b, it certainly divides 2 _J fe which is an integer, and thus it divides a + q2~^b. 
Reciprocally, if some odd prime divides both b and r, it divides also 2 _J fe, 
thus it divides a = r — q2~^b; this shows that no spurious factor appears, 
unlike in other gcd algorithms. 
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Example: let a = 935 and b = 714. We have v(b) = u(a) + l, thus Algorithm 
BinaryDivide computes b' = 357, q = 1, and 02 = a + q2~^b = 1292. The 
next step gives 03 = 1360, then a 4 = 1632, a§ = 2176, a% = 0. Since 
2176 = 2 7 • 17, we conclude that the gcd of 935 and 714 is 17. 

The corresponding gcd algorithm is quite similar to Algorithm HalfGcd, 
except it selects the low significant parts for the recursive calls, and uses 
BinaryDivide instead of the classical division. See the references in £ 11.91 
for more details. 

1.7 Base Conversion 

Since computers usually work with binary numbers, and human prefer deci- 
mal representations, input/output base conversions are needed. In a typical 
computation, there will be only few conversions, compared to the total num- 
ber of operations, thus optimizing conversions is less important. However, 
when working with huge numbers, naive conversion algorithms — which sev- 
eral software packages have — may slow down the whole computation. 

In this section we consider that numbers are represented internally in 
base (3 — usually a power of 2 — and externally in base B — say a power of 
10. When both bases are commensurable, i.e., both are powers of a common 
integer, like 8 and 16, conversions of n-digit numbers can be performed in 
0(n) operations. We assume here that j3 and B are not commensurable. 

One might think that only one algorithm is needed, since input and output 
are symmetric by exchanging bases /3 and B. Unfortunately, this is not true, 
since computations are done in base j3 only (see Ex. I1.30|) . 

1.7.1 Quadratic Algorithms 

Algorithms Integerlnput and IntegerOutput respectively read and write 
n-word integers, both with a complexity of 0(n 2 ). 

1.7.2 Subquadratic Algorithms 

Fast conversions routines are obtained using a "divide and conquer" strategy. 
For integer input, if the given string decomposes as S — S^ || 6*i where S\ 
has k digits in base B, then 

Input (S, B) = Inputs, B)B k + Input (^U), B), 
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Algorithm 22 Integerlnput 

Input: a string S = s m _i . . . SiSq of digits in base B 

Output: the value A of the integer represented by S 

A^O 

for i from m — 1 downto do 
A <- BA + val(sj) 

return A. 

Algorithm 23 IntegerOutput 

Input: A = J2o~ a $ % 

Output: a string S of characters, representing A in base B 
m <— 
while ,4 ^ do 

s m <— char (A mod -B) 
A <- A div 5 
m <— m + 1 
return 5* = s m _i . . . SiSq. 

where Input (S, B) is the value obtained when reading the string S in the 
external base B. Algorithm Fastlntegerlnput shows a possible way to 
implement this: If the output A has n words, algorithm Fastlntegerlnput 
has complexity 0(M(n) logn), more precisely ~ |M(n/2) lgn for n a power 
of two in the FFT range (see Ex. fOTjl . 

For integer output, a similar algorithm can be designed, replacing multi- 
plications by divisions. Namely, if A = A^B k + A\ Q , then 

Output (A, B) = Output(A hi , B) 1 1 0utput(Ai o , B), 

where Output (A, B) is the string resulting from writing the integer A in the 
external base B, S± || So denotes the concatenation of Si and So, and it is 
assumed that Output (A\ , B) has exactly k digits, after possibly padding 
with leading zeros. 

If the input A has n words, algorithm FastlntegerOutput has complex- 
ity 0(M{n) logn), more precisely ~ |D(n/2) lgn for n a power of two in the 
FFT range, where D(n/2) is the cost of dividing an n-word integer by an 
n/2-word integer. Depending on the cost ratio between multiplication and 
division, integer output may thus be 2 to 5 times slower than integer input; 
see however Ex. 11.281 
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Algorithm 24 Fastlntegerlnput 



Input: a string S = s m _i . . . SiSo of digits in base B 
Output: the value A of the integer represented by S 
I <- [val(s ), val(si), . . . , val(s m _i)] 
{b,k)<-(B,m) 
while k > 1 do 

if k even then £ <- [£ 1 + W 2 , £ 3 + b£ 4 , ..., 4_i + W A 
else^<- [£i + M 2 ,£ 3 + M 4 , ...,£ k ) 
(b,k)^(b 2 ,\k/2}) 



return l\. 



Algorithm 25 FastlntegerOutput 



Input: A = YTq * a if^ 

Output: a string S of characters, representing A in base B 
if A < B then 

char (A) 
else 

find k such that 5 2fe - 2 < A < B 2k 
(Q, i?) <- DivRem(A, £ fc ) 
r <— FastlntegerOutput (i?) 
FastlntegerOutput (Q) | |0 fc " len W | |r 
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1.8 Exercises 

Exercise 1.1 Extend the Kronecker-Schonhage trick from the beginning of fll.31 
to negative coefficients, assuming the coefficients are in the range [—p,p\. 

Exercise 1.2 (Harvey [92j) For multiplying two polynomials of degree less than 
n, with non-negative integer coefficients bounded above by p, the Kronecker- 
Schonhage trick performs one integer multiplication of size about 2n lg p, assuming 
n is small compared to p. Show that it is possible to perform two integer multipli- 
cations of size nlgp instead, and even four integer multiplications of size ^nlgp. 

Exercise 1.3 Assume your processor provides an instruction f maa(a, b, c, d) re- 
turning h, I such that ab + c + d = h(5 + I where < a, b, c, d,l,h < (3. Rewrite 
Algorithm BasecaseMultiply using f maa. 

Exercise 1.4 (Hanrot) Prove that for uq = 2, the number K(n) of word prod- 
ucts in Karatsuba's algorithm as defined in Th. 11.3.21 is non-decreasing (caution: 
this is no longer true with a larger threshold, for example with no = 8 we have 
K(7) = 49 whereas K(8) = 48). Plot the graph of lo ™ 3 with a logarithmic scale 
for n, for 2 7 < n < 2 10 , and find experimentally where the maximum appears. 

Exercise 1.5 (Ryde) Assume the basecase multiply costs M{n) = an 2 + bn, and 
that Karatsuba's algorithm costs K(n) = 3K(n/2) + en. Show that dividing a by 
two increases the Karatsuba threshold uq by a factor of two, and on the contrary 
decreasing b and c decreases hq. 

Exercise 1.6 (Maeder [121], Thome) Show that an auxiliary memory of 2n + 
o{n) words is enough to implement Karatsuba's algorithm in-place, for a n x n 
product. In the polynomial case, even prove that an auxiliary space of n coefficients 
is enough, in addition to the n + n coefficients of the input polynomials, and the 
2n — 1 coefficients of the product. [It is allowed to use the 2n result words, but 
not to destroy the n + n input words.] 

Exercise 1.7 (Quercia, McLaughlin) Modify Alg. KaratsubaMultiply to use 

only ~ |n additions/subtractions. [Hint: decompose each of Co, C\ and C*2 into 
two parts.] 

Exercise 1.8 Design an in-place version of Algorithm KaratsubaMultiply (see 
Ex. II. 6|) that accumulates the result in cq, . . . , C2 n _i, and returns a carry bit. 
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Exercise 1.9 (Vuillemin |171p Design a program or circuit to compute a 3 x 
2 product in 4 multiplications. Then use it to perform a 6 x 6 product in 16 
multiplications. How does this compare asymptotically with Karatsuba and Toom- 
Cook 3-way? 

Exercise 1.10 (Weimerskirch, Paar) Extend the Karatsuba trick to compute 
an n x n product in 1 -^ — - multiplications and — — ^~ — additions/subtractions. 
For which n does this win? 

Exercise 1.11 (Hanrot) In Algorithm OddEvenKaratsuba, in case both m 
and n are odd, one combines the larger parts Aq and -Bo together, and the smaller 
parts Ai and B\ together. Find a way to get instead: 

K{m,n) = K(\m/2],[n/2\)+K([m/2\, \n/2\) + K{\m/2\, \n/2\). 

Exercise 1.12 Prove that if 5 integer evaluation points are used for Toom-Cook 
3-way, the division by 3 can not be avoided. Does this remain true if only 4 integer 
points are used together with oo? 

Exercise 1.13 (Quercia, Harvey) In Toom-Cook 3-way, take as evaluation point 
2 W instead of 2 ( 81.3.3J1 . where w is the number of bits per word (usually w = 32 
or 64). Which division is then needed? Same question for the evaluation point 

2U>/2 

Exercise 1.14 For multiplication of two numbers of size kn and n respectively, for 
an integer k > 2, show that the trivial strategy which performs k multiplications, 
each n x n, is not always the best possible. 

Exercise 1.15 (Karatsuba, Zuras [181J) Assuming the multiplication has su- 
perlinear cost, show that the speedup of squaring with respect to multiplication 
can not exceed 2. 

Exercise 1.16 (Thome, Quercia) Multiplication and the middle product are 
just special cases of linear forms programs: consider two set of inputs Oi, . . . ,a n 
and bi,...,b m , and a set of outputs c\, . . . , c& that are sums of products of a,ibj. 
For such a given problem, what is the least number of multiplies required? As an 
example, can we compute x = au + cw, y = av + bw, z = bu + cv in less than 6 
multiplies? Same question for x = au — cw, y = av — bw, z = bu — cv. 



Exercise 1.17 In algorithm BasecaseDivRem ( Hl.4.1)) . prove that q*- < (3 + 1. 
Can this bound be reached? In the case q^ > (3, prove that the while-loop at steps 
9-11 is executed at most once. Prove that the same holds for Svoboda's algorithm 
( H1.4.2|) . i.e., that A > after step El 
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Exercise 1.18 (Granlund, Moller) In algorithm BasecaseDivRem, estimate 
the probability that A < is true at line HO assuming the remainder r,- from the 
division of a n +jf3 + a n +j-i by b n -\ is uniformly distributed in [0, b n -i — I], A mod 
pn+j-i j g un iformly distributed in [0, /3 n+J_1 — 1], and B mod (3 n ~ l is uniformly 
distributed in [0, /3 n_1 — 1]. Then replace the computation of q* by a division of the 
three most significant words of A by the two most significant words of B. Prove 
the algorithm is still correct; what is the maximal number of corrections, and the 
probability that A < holds? 

Exercise 1.19 (Montgomery 1132]) Let < b < (5, and < a 4 ,...,a < (3. 
Prove that 04 (/3 4 mod b) + • • • + ai(j3 mod b) + ao < /3 2 , as long as b < [3/3. Use 
that fact to design an efficient algorithm dividing A = a n _i/3 n_1 + • • • + ao by b. 
Does that algorithm extend to division by the least significant digits? 

Exercise 1.20 In Algorithm RecursiveDivRem, find inputs that require 1, 2, 
3 or 4 corrections in step El [Hint: consider j3 = 2]. Prove that when n = m and 
A < [3 m (B + 1), at most two corrections occur. 

Exercise 1.21 Find the complexity of Algorithm RecursiveDivRem in the 

FFT range. 

Exercise 1.22 Consider the division of A of kn words by B of n words, with 
integer k > 3, and the alternate strategy that consists in extending the divisor with 
zeros so that it has half the size of the dividend. Show this is always slower than 
Algorithm UnbalancedDivision [assuming the division has superlinear cost]. 

Exercise 1.23 An important special base of division is when the divisor is of the 
form b k . This is useful for example for the output routine ( £11.7(1 . Can one design 
a fast algorithm for that case? 

Exercise 1.24 (Sedoglavic) Does the Kronecker-Schonhage trick to reduce poly- 
nomial multiplication to integer multiplication ( £11.3(1 also work — in an efficient 
way — for division? Assume for example you want to divide a degree-2n polyno- 
mial A(x) by a monic degree-n polynomial B(x), both having integer coefficients 
bounded by p. 

Exercise 1.25 Design an algorithm that performs an exact division of a 4n-bit 
integer by a 2n-bit integer, with a quotient of 2n bits, using the idea from the last 
paragraph of 91.4.51 Prove that your algorithm is correct. 

Exercise 1.26 (Shoup) Show that in ExtendedGcd, if a > b > 0, and g = 

gcd(a, 6), then the cofactor u satisfies —b/(2g) <u< b/(2g). 



Modern Computer Arithmetic, §1.9 53 

Exercise 1.27 Find a formula T(n) for the asymptotic complexity of Algorithm 
Fastlntegerlnput when n = 2 ( fll.7.2[l . and show that, for general n, it is within 
a factor of two of T{n). [Hint: consider the binary expansion of n]. Design another 
subquadratic algorithm that works top-down: is is faster? 

Exercise 1.28 Show that asymptotically, the output routine can be made as fast 
as the input routine Fastlntegerlnput. [Hint: use Bernstein's scaled remain- 
der tree and the middle product.] Experiment with it on your favorite multiple- 
precision software. 

Exercise 1.29 If the internal base f3 and the external one B share a common 
divisor — as in the case /3 = 2 l and B = 10 — show how one can exploit this to 
speed up the subquadratic input and output routines. 

Exercise 1.30 Assume you are given two n-digit integers in base 10, but you 
have fast arithmetic in base 2 only. Can you multiply them in 0(M(n))? 



1.9 Notes and References 

"Online" algorithms are considered in many books and papers, see for example 
the book by Borodin and El-Yaniv (2^]. "Relaxed" algorithms were introduced by 
van der Hoeven. For references and a discussion of the differences between "lazy" , 
"zealous" and "relaxed" algorithms, see |169j . 

An example of implementation with "guard bits" to avoid overflow problems in 
the addition (i jl.2|) is the block-wise modular arithmetic from Lenstra and Dixon 
on the MasPar [72], where they used j3 = 2 30 with 32-bit words. 

The fact that polynomial multiplication reduces to integer multiplication is at- 
tributed to Kronecker, and was rediscovered by Schonhage |150j . Nice applications 
of the Kronecker-Schonhage trick are given in |158j . Very little is known about 
the average complexity of Karatsuba's algorithm. What is clear is that no simple 
asymptotic equivalent can be obtained, since the ratio K(n)/n a does not converge. 
See Ex. O 

A very good description of Toom-Cook algorithms can be found in |66| Section 
9.5.1], in particular how to symbolically generate the evaluation and interpola- 
tion formulae. Bodrato and Zanoni show that the Toom-Cook 3-way interpolation 
scheme from ^1.3.31 is near from optimal — for the points 0, 1, —1, 2, oo; they also 
exhibit efficient 4-way and 5- way schemes |23j . 

The odd-even scheme is described in |90| . and was independently discovered 
by Andreas Enge. 
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The exact division algorithm starting from least significant bits is due to Je- 
belean jlOlj . who also invented with Krandick the "bidirectional" algorithm |111| . 
The Karp-Markstein trick to speed up Newton's iteration (or Hensel lifting over 
p-adic numbers) is described in |107j . The "recursive division" in £11.4.31 is from 
|52] . although previous but not-so-detailed ideas can be found in |128j and |103j . 
The definition of Hensel's division used here is due to Shand and Vuillemin |154j , 
who also point out the duality with Euclidean division. 

Algorithm SqrtRem ( jjTJ 5.1() was first described in J180J . and proven in |2'2j . 
Algorithm Sqrtlnt is described in [HOI; its generalization to k-th roots (Algorithm 
Rootlnt) is due to Keith Briggs. The detection of exact powers is discussed in 
EU and especially in Q3IIISI- 

The classical (quadratic) Euclidean algorithm has been considered by many 
authors — a good reference is Knuth |110j . The binary gcd is almost as old as the 
classical Euclidean algorithm — Knuth |ll()j has traced it back to a first-century 
AD Chinese text, though it was rediscovered several times in the 20th century. 
The binary gcd has been analysed by Brent |35[ I41j . Knuth |110| . Maze |122j 
and Vallee J168J . A parallel (systolic) version that runs in 0(n) time using 0(n) 
processors was given by Brent and Kung |46j . 

The double-digit gcd is due to Jebelean |102j . The fc-ary gcd reduction is due 
to Sorenson J157J . and was improved and implemented in GNU MP by Weber; 
Weber also invented Algorithm ReducedRatMod |174j . which is inspired from 
previous work of Wang. 

The first subquadratic gcd algorithm was published by Knuth |l()9j . but his 
analysis was suboptimal — he gave 0(n(log n) 5 (log log n)) — , and the correct com- 
plexity was given by Schonhage |148j : for this reason the algorithm is sometimes 
called the Knuth- Schonhage algorithm. A description for the polynomial case can 
be found in [2], and a detailed but incorrect one for the integer case in |179j . The 
subquadratic binary gcd given here is due to Stehle and Zimmermann |160j . 



Chapter 2 

The FFT and Modular 
Arithmetic 



In this Chapter our main topic is modular arithmetic, i.e., how 
to compute efficiently modulo a given integer N. In most appli- 
cations, the modulus N is fixed, and special-purpose algorithms 
benefit from some precomputations depending on TV only, that 
speed up arithmetic modulo N. 

In this Chapter, unless explicitly stated, we consider that the 
modulus N occupies n words in the word-base /3, i.e., j3 n ~ l < 

N < (3 n . 

2.1 Representation 

We consider in this section the different possible representations of residues 
modulo N. As in Chapter ^ we consider mainly dense representations. 

2.1.1 Classical Representation 

The classical representation stores a residue (class) a as an integer < a < N. 
Residues are thus always fully reduced, i.e., in canonical form. 

Another non-redundant form consists in choosing a symmetric represen- 
tation, say —N/2 < a < N/2. This form might save some reductions in 
additions or subtractions (see §2.2J1 . Negative numbers might be stored ei- 
ther with a separate sign, or with a two's-complement representation. 

55 
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Since TV takes n words in base (3, an alternative redundant representation 
consists in choosing < a < (3 n to represent a residue class. If the under- 
lying arithmetic is word-based, this will yield no slowdown compared to the 
canonical form. An advantage of this representation is that when adding 
two residues, it suffices to compare their sum with f3 n , and the result of this 
comparison is simply given by the carry bit of the addition (see Algorithm 
Integer Addition in m.2|) . instead of comparing the sum with N. 

2.1.2 Montgomery's Form 

Montgomery's form is another representation widely used when several mod- 
ular operations have to be performed modulo the same integer N (additions, 
subtractions, modular multiplications). It implies a small overhead to con- 
vert - if needed -- from the classical representation to Montgomery's and 
vice- versa, but this overhead is often more than compensated by the speedup 
obtained in the modular multiplication. 

The main idea is to represent a residue a by a' = aR mod N, where R = 
j3 n , and N takes n words in base /3. Thus Montgomery is not concerned with 
the physical representation of a residue class, but with the meaning associated 
to a given physical representation. (As a consequence, the different choices 
mentioned above for the physical representation are all possible.) Addition 
and subtraction are unchanged, but (modular) multiplication translates to a 
different, much simpler, algorithm (t j2.3.2[) . 

In most applications using Montgomery's form, all inputs are first con- 
verted to Montgomery's form, using a' = aR mod N, then all computations 
are performed in Montgomery's form, and finally all outputs are converted 
back — if needed — to the classical form, using a = a' / R mod N . We need 
to assume that (R,N) = 1, or equivalently that (J3,N) = 1, to ensure the 
existence of 1/R mod N. This is not usually a problem because f3 is a power 
of two and N can be assumed to be odd. 



2.1.3 Residue Number Systems 

In a Residue Number System, a residue a is represented by a list of residues 
aj modulo Ni, where the moduli Ni are coprime and their product is N. The 
integers a, can be efficiently computed from a using a remainder tree, and 
the unique integer < a < N = NiN 2 ■ ■ ■ are computed from the Cj by an 
Explicit Chinese Remainder Theorem f< J2.6|) . The residue number system is 
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classical (MSB) 


p-adic (LSB) 


Euclidean division 


Hensel division, Montgomery reduction 


Svoboda's algorithm 


Montgomery-Svoboda 


Euclidean gcd 


binary gcd 


Newton's method 


Hensel lifting 



Figure 2.1: Equivalence between LSB and MSB algorithms. 

interesting since addition and multiplication can be performed in parallel on 
each small residue a;. However this representation requires that TV factors 
into convenient moduli Ni, N 2 , . . ., which is not always the case (see however 

gZSD- 

2.1.4 MSB vs LSB Algorithms 

Many classical (most significant bits first or MSB) algorithms have a p-adic 
(least significant bit first or LSB) equivalent form. Thus several algorithms 
in this Chapter are just LSB-variants of algorithms discussed in Chapter [T] 
(see Fig. EHJ) . 

2.1.5 Link with Polynomials 

As in Chapter ^ a strong link exists between modular arithmetic and arith- 
metic on polynomials. One way of implementing finite fields ¥ q with q = p n 
is to work with polynomials in F p [x], which are reduced modulo a monic irre- 
ducible polynomial f(x) G ¥ p [x] of degree n. In this case modular reduction 
happens both at the coefficient level — i.e., in F p — and at the polynomial 
level, i.e., modulo fix). 

Some algorithms work in the ring (Z/nZ) [x] , where n is a composite 
integer. An important case is Schonhage-Strassen's multiplication algorithm, 
where n has the form 2^ + 1. 

In both cases — ¥ p [x] and (Z/nZ)[x] — , the Kronecker-Schonhage trick 
f< jl.3|) can be applied efficiently. Indeed, since the coefficients are known to be 
bounded, by p and n respectively, and thus have a fixed size, the segmentation 
is quite efficient. If polynomials have degree d and coefficients are bounded 
by n, one obtains O (M(d log n)) operations, instead of M (d)M (log n) with 
the classical approach. Also, the implementation is simpler, because we only 
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have to implement fast arithmetic for large integers instead of fast arithmetic 
at both the polynomial level and the coefficient level (see also Ex. I1.2J1 . 

2.2 Addition and Subtraction 

The addition of two residues in classical representation is done as follows: 
Assuming a and b are uniformly distributed in [0,N — 1], the subtraction 

Algorithm 26 ModularAdd 

Input: residues a, b with < a, b < N. 

Output: c = a + b mod N. 

c <— a + b 

if c > TV then 
c<- c- N 

c <— c — N is performed with probability (1 — l/N)/2. If we use instead a 
symmetric representation in [— N/2, N/2), the probability that we need to 
add or subtract N drops to 1/4 + 0(l/iV 2 )at the cost of an additional test. 
This extra test might be expensive for small N - - say one or two words - 
but will be relatively cheap as long as N uses say a dozen words. 

2.3 Multiplication 

Modular multiplication consists in computing A ■ B mod N , where A and B 
are residues modulo N , which has n words in base j3. We assume here that A 
and B have at most n words, and in some cases that they are fully reduced, 
i.e., < A,B < N. 

The naive algorithm first computes C = AB, which has at most 2n words, 
and then reduces C modulo N. When the modulus N is not fixed, the best 
known algorithms are those presented in Chapter m( ^1.4.6]) . We consider here 
better algorithms that benefit from an invariant modulus. These algorithms 
are allowed to perform precomputations depending on N only. The cost of the 
precomputations is not taken into account: it is assumed negligible for many 
modular reductions. However, we assume that the amount of precomputed 
data uses only linear space, i.e., 0(log N) memory. 

Algorithms with precomputations include Barrett's algorithm (i J2.3.1j) . 
which computes an approximation of the inverse of the modulus, thus trading 
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division for multiplication; Montgomery's algorithm, which corresponds to 
Hensel's division with remainder only (i il.4.8|) . and its subquadratic variant, 
which is the MSB- variant of Barrett's algorithm; and finally McLaughlin's 
algorithm ( XZXty . 

2.3.1 Barrett's Algorithm 



Barrett's algorithm [TT] is interesting when many divisions have to be made 
with the same divisor; this is the case when one performs computations mod- 
ulo a fixed integer. The idea is to precompute an approximation of the divisor 
inverse. In such a way, an approximation of the quotient is obtained with just 
one multiplication, and the corresponding remainder after a second multipli- 
cation. A small number of corrections suffice to convert those approximations 
into exact values. For sake of simplicity, we describe Barrett's algorithm in 
base /?; however, j3 might be replaced by any integer, in particular 2 n or j3 n . 

Algorithm 27 BarrettDivRem 

Input: integers A, B with < A < (3 2 , (3/2 < B < (3. 

Output: quotient Q and remainder R of A divided by B. 



1 


/ «- \p/B\ 


> precomputation 


2 


Q <- [AJ/P] where A -- 


= A x (3 + A with < A < (3 


3 


R<- A-QB 




4 


while R > B do 




5 


(Q,R)<-(Q + 1,R- 


-B) 


6 


return (Q, R). 





Theorem 2.3.1 Algorithm BarrettDivRem is correct and step [3| is per- 
formed at most 3 times. 

Proof. Since A = QB+R is invariant in the algorithm, we just need to prove 
that < R < B at the end. We first consider the value of Q, R before the 
while-loop. Since (5/2 < B < f3, we have f3 < [3 2 / B < 2/3, thus (3 < I < 2(3. 
We have Q < AxIf/3 < Ai/3/B < A/B. This ensures that R is nonnegative. 
Now / > j3 2 / B — 1, which gives 

IB>(3 2 -B. (2.1) 
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Similarly, Q > A-J. j — 1 gives 

0Q>A 1 I-0. (2.2) 

This yields /3QB > AJB-[3B > A l (f3 2 -B)-I3B = (3(A-A )-B((3+A 1 ) > 
(3 A - 4/3 B since A < [3 < 2B and A ± < [3. We conclude that A < B(Q + 4), 
thus at most 3 corrections are needed. rj 

The bound of 3 corrections is tight: it is obtained for A = 1980, B = 36, 
n = 6. For this example / = 113, Ai = 30, Q = 52, R = 108 = 3B. 

The multiplications at steps El and 01 may be replaced by short products, 
more precisely that of step El by a high short product, and that of step El by 
a low short product (see < J3.3D . 

Complexity of Barrett's Algorithm 

If the multiplications at steps El and El are performed using full products, 
Barrett's algorithm costs 2M(n) for a divisor of size n. In the FFT range, this 
costs might be lowered to 1.5M(n) using the "wrap-around trick" f ^3.4.1D ; 
moreover, if the forward transforms of / and B are stored, the cost decreases 
to M(n), assuming one FFT equals three transforms. 

2.3.2 Montgomery's Multiplication 

Montgomery's algorithm is very efficient to perform modular arithmetic mod- 
ulo a fixed modulus N. The main idea is to replace a residue A mod N 
by A' = XA mod N, where A' is the "Montgomery form" corresponding to 
the residue A. Addition and subtraction are unchanged, since XA + XB = 
X(A + B) mod N. The multiplication of two residues in Montgomery form 
does not give exactly what we want: (XA)(XB) ^ X(AB) mod N. The trick 
is to replace the classical modular multiplication by "Montgomery's multi- 
plication" : 

A'B' 
MontgomeryMul(y4', B') = mod N. 

X 

For some values of A, MontgomeryMul(y4/, B') can be easily computed, 
in particular for A = (3 n , where TV uses n words in base /3. Algorithm |2"8~1 
presents a quadratic algorithm (REDC) to compute MontgomeryMul(A', 
B') in that case, and a subquadratic reduction (FastREDC) is given in 
Algorithm El 
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Another view of Montgomery's algorithm for A = j3 n is to consider that 
it computes the remainder of Hensel's division ( fll.4.81) . 

Algorithm 28 REDC (quadratic non- interleaved version). The d form the 
current base-/? decomposition of C, i.e., the c« are defined by C = ^2 n ~ Ci(3 l . 

Input: < C < p 2n , N < p n , fi< A" 1 mod /?, (J3, N) = 1 

Output: < R < (3 n such that R = C(3~ n mod N 



for i from to n — 1 do 

qi <— iiCi mod (3 

C^C + qi Np 
R <- C/?- n 
if R > f3 n then return R — N else return i?. 



Theorem 2.3.2 Algorithm REDC is correct. 

Proof. We first prove that R = Cj3~ n mod N: C is only modified in line El 
which does not change C mod N, thus at line H] we have R = C(3~ n mod N, 
which remains true in the last line. 

Assume that for a given i, we have C = mod j3 l when entering step EJ 
Then since g» = —Ci/N mod /3, we have C + qiN/3 l = mod /3 i+1 at the next 
step, so Ci = 0. Thus when one exits the for-loop, C is a multiple of (3 n , thus 
R is an integer at step |3J 

Still at step H we have C < [5 2n + (/3 - 1)A(1 + /? + •••+ /3™- 1 ) = 
/? 2n + A(/? n - 1) thus i? < (3 n + AT, and R - N < p n . n 

Compared to classical division (Algorithm BasecaseDivRem, ^1.4. 1|) . 
Montgomery's algorithm has two significant advantages: the quotient selec- 
tion is performed by a multiplication modulo the word base /3, which is more 
efficient than a division by the most significant word b n _i of the divisor as in 
BasecaseDivRem; and there is no repair step inside the for-loop — there 
is only one at the very end. 

For example, with inputs C = 766970544842443844, TV = 862664913, and 
j3 = 1000, Algorithm REDC precomputes fi = 23, then we have go = 412, 
which yields C = C + 4127V = 766970900260388000; then qi = 924, which 
yields C = C + 924N/3 = 767768002640000000; then q 2 = 720, which yields 
C = C + 720Np 2 = 1388886740000000000. At stepH R = 1388886740, and 
since R > p 3 , one returns R- N = 526221827. 
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Since Montgomery's algorithm -- i.e., Hensel's division with remainder 
only — can be viewed as an LSB variant of the classical division, Svoboda's 
divisor preconditioning ( ^1.4. 2|) also translates to the LSB context. More 
precisely, in Algorithm REDC, one wants to modify the divisor TV so that 
the "quotient selection" q <— u-d mod (3 at step 121 becomes trivial. A natural 
choice is to have fi = 1, which corresponds to iV = —1 mod j3. The multiplier 
k used in Svoboda division is thus here simply the parameter /i in REDC. 
The Montgomery- Svoboda algorithm obtained works as follows: 

1. first compute N' = u,N, with N' < (3 n+1 ; 

2. then perform the n — 1 first loops of REDC, replacing \x by 1, and N 

byiV'; 

3. perform a last classical loop with /i and N, and the last steps from 
REDC. 

The quotient selection with Montgomery-Svoboda's algorithm thus simply 
consists in "reading" the word of weight (3 % in the divisor C . 

For the above example, one gets N' = 19841292999, qo is the least signif- 
icant word of C, i.e., q = 844, then C = C + 8UN' = 766987290893735000; 
then q t = 735 and C = C + 735N'fJ = 781570641248000000; the last step 
gives q 2 = 704, and C = C + 7047V/3 2 = 1388886740000000000, which is the 
same value that we found above after the for-loop. 



Subquadratic Montgomery Reduction 

A subquadratic version of REDC is obtained by taking n = 1 in Algorithm 
REDC, and considering j3 as a "giant base" (alternatively, replace (3 by j3 n 
below): 

Algorithm 29 FastREDC (Subquadratic Montgomery multiplication) 

Input: < C < (3 2 , N < (3, // < 1/N mod /? 

Output: < R < (3 such that R = C / (3 mod TV 

1: Q <- fiC mod j3 

2: R^- {C + QN)/p 

3: if R > f3 then return R — N else return R. 
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This is exactly the 2-adic counterpart of Barrett's subquadratic algorithm: 
steps 012 might be performed by a low short product and a high short product 
respectively. 

When combined with Karatsuba's multiplication assuming the products 
of steps 0E1 are full products, the reduction requires 2 multiplications of size 
n, i.e., 6 multiplications of size n/2 (n denotes the size of N, (3 being a giant 
base). , 

With some additional precomputation, the reduction might be performed 
with 5 multiplications of size n/2, assuming n is even. This is simply 
Montgomery-Svoboda's algorithm with N having two big words in base (3 n l 2 : 
The cost of the algorithm is M(n,n/2) to compute qoN' (even if N' has in 

Algorithm 30 MontgomerySvoboda2 

Input: < C < (3 2n , N < (3 n , u. < 1/N mod (3 n / 2 , N' = iiN 

Output: < R < (3 n such that R = C/(3 n mod N 
1: q <- C mod (5 n ' 2 



C^(C + q N')/(3 n / 2 

gi <- u.C mod (3 n / 2 

R^{C + qi N)//3 n / 2 

if R > (3 n then return R — N else return R. 



principle 3n/2 words, we know N' = H[3 n ^ 2 — 1 with H < fi n , thus it suffices 
to multiply q by H), M(n/2) to compute jiC mod /? n//2 , and again M(n, n/2) 
to compute q\N, thus a total of 5M(n/2) if each n x (n/2) product is realized 
by two (n/2) x (n/2) products. 

The algorithm is quite similar to the one described at the end of ^1.4.61 
where the cost was 3M(n/2) + D(n/2) for a division of 2n by n with re- 
mainder only. The main difference here is that, thanks to Montgomery's 
form, the last classical division D(n/2) in Svoboda's algorithm is replaced 
by multiplications of total cost 2M(n/2), which is usually faster. 

Algorithm MontgomerySvoboda2 can be extended as follows. The 
value C obtained after step EJ has 3n/2 words, i.e., an excess of n/2 words. 
Instead of reducing that excess with REDC, one could reduce it using 
Svoboda's technique with /j,' = — 1/N mod /? n//4 , and TV" = fx'N. This 
would reduce the low n/4 words from C at the cost of M(n,n/4), and a 
last REDC step would reduce the final excess of n/4, which would give 
D(2n,n) = M(n,n/2) +M(n,n/4) + M(n/4) +M(n,n/4). This "folding" 
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Algorithm 


Karatsuba 


Toom-Cook 3- way 


Toom-Cook 4-way 


D(n) 


2M(n) 


2.63M(n) 


3.10M(n) 


1-folding 


1.67M(n) 


1.81M(n) 


1.89M(n) 


2-folding 


1.67M(n) 


1.91M(n) 


2.04M(n) 


3-folding 


1.74M(n) 


2.06M(n) 


2.25M(n) 



Figure 2.2: Theoretical complexity of subquadratic REDC with 1-, 2- and 
3— folding, for different multiplication algorithms. 

process can be generalized to D(2n,n) = M(n,n/2) + • • • + M{n,n/2 k ) + 
M(n/2 k ) + M(n,n/2 k ). If M(n,n/2 k ) reduces to 2 k M(n/2 k ), this gives: 

D{n) = 2M(n/2) + AM (n/4) + • • • + 2 fc - 1 M(n/2 fc - 1 ) + (2 fc+1 + 1)M (n/2 fc ). 

Unfortunately, the resulting multiplications are more and more unbalanced, 
moreover one needs to store k precomputed multiples N', N", . . . of N, each 
requiring at least n words (if one does not store the low part of those constants 
which all equal —1). Fig. 12.21 shows that the single- folding algorithm is the 
best one. 

Exercise 12.31 discusses further possible improvements in Montgomery- 
Svoboda's algorithm, which achieve D(n) = 1.58M(n) in the case of 
Karatsuba multiplication. 

2.3.3 McLaughlin's Algorithm 

McLaughlin's algorithm assumes one can perform fast multiplication modulo 
both 2 n — 1 and 2 n + 1, for sufficiently many values of n. This assumption is 
true for example with Schonhage-Strassen's algorithm: the original version 
multiplies two numbers modulo 2" + l, but discarding the "twist" operations 
before and after the Fourier transforms computes their product modulo 2 n — 1. 
(This has to be done at the top level only: the recursive operations compute 
modulo 2" + 1 in both cases.) 

The key idea in McLaughlin's algorithm is to avoid the classical "multiply 
and divide" method for modular multiplication. Instead, assuming iV is 
prime to 2" — 1, it determines AB/(2 n — 1) mod A^ with convolutions modulo 
2 n ± 1, which can be performed in an efficient way using FFT. 

Theorem 2.3.3 Algorithm MultMcLaughlin computes AB/(2 n — 1) mod 
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Algorithm 31 MultMcLaughlin 

Input: A, B with < A, B < N < 2 n , /i = -N' 1 mod (2 n - 1) 

Output: AB/(2 n - 1) mod N 

1: m <- AB^ mod (2 n - 1) 

2: S ^- (AB + miV) mod (2 n + 1) 

3: w < S mod (2 n + 1) 

4: If 2\w, then s <- io/2 else s <- (w + 2 n + l)/2 

5: If AB + mTV = s mod 2 then £ <- s else t <- s + 2 n + 1 

6: If t < N then return t else return t — N 



N correctly, in w 1.5M(n) operations, assuming multiplication modulo 2 n ±l 
costs ^M(n), or 3 Fourier transforms of size n. 

Proof. Step^is similar to Step^of Algorithm FastREDC, with j3 replaced 
by 2 n - 1. It follows AB + mN = mod (2 n - 1), therefore we have AB + 
mN = k(2 n - 1) with < k < 2N. Step H computes S = -2k mod (2 n + 1), 
then stepElgives w = 2k mod (2 n + l), and s = k mod (2 n + l) in step^J Now 
since < k < 2 n+1 the value s does not uniquely determine k, whose missing 
bit is determined from the least significant bit from AB + mN (step EJ). 
Finally the last step reduces t = k modulo N . 

The cost of the algorithm is mainly that of the four convolutions AB mod 
(2 n ± 1), (AB)fj, mod (2 n - 1) and mN mod (2 n + 1), which cost 2M{n) alto- 
gether. However in (AB)u. mod (2 n — 1) and mN mod (2 n + 1), the operands 
\x and A^ are invariant, therefore their Fourier transform can be precom- 
puted, which saves |M(n). A further saving of |M(n) is obtained since we 
perform only one backward Fourier transform in step 121 This finally gives 
(2-I-|)M(n) = 1.5M(n). □ 

The 1.5M(n) cost of McLaughlin's algorithm is quite surprising, since 
it means that a modular multiplication can be performed faster than two 
multiplications. In other words, since a modular multiplication is basically a 
multiplication followed by a division, this means that the "division" can be 
performed for the cost of half a multiplication! 

2.3.4 Special Moduli 

For special moduli N faster algorithms may exist. Ideal is N = /3 n ±l. This is 
precisely the kind of modulus used in the Schonhage-Strassen algorithm based 
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on the Fast Fourier Transform (FFT). In the FFT range, a multiplication 
modulo j3 n ± 1 is used to perform the product of two integers of at most n/2 
words, and a multiplication modulo (3 n ± 1 costs M (n/2) < |M(n). 

However in most applications the modulus cannot be chosen, and there is 
no reason for it to have a special form. We refer to i j2 .81 for further information 
about special moduli. 

An important exception is elliptic curve cryptography (ECC), where one 
almost always uses a special modulus, for example pseudo-Mersenne primes 
like 2 192 - 2 64 - 1 or 2 256 - 2 224 + 2 192 + 2 96 - 1. 

2.3.5 Fast Multiplication Over GF(2)[z] 

The finite field GF(2) is quite important in cryptography and number theory. 
An alternative notation for GF(2) is F 2 . Usually one considers polynomials 
over GF(2), i.e., elements of GF(2)[x], also called binary polynomials: 

d-i 

A = ) OL^X , 
i=0 

where a; £ GF(2) can be represented by a bit, either or 1. The computer 
representation of a binary polynomial of degree d — 1 is thus similar to that 
of an integer of d bits. It is natural to ask how fast one can multiply such bi- 
nary polynomials. Since multiplication over GF(2)[x] is essentially the same 
as integer multiplication, but without the carries, multiplying two degree-d 
binary polynomials should be faster than multiplying two d-bit integers. In 
practice, this is not the case, the main reason being that modern processors 
do not provide a multiply instruction over GF(2)[x] at the word level. We 
describe here fast multiplication algorithms over GF(2)[x]: a base-case al- 
gorithm, a Toom-Cook 3- way variant, and an algorithm due to Schonhage 
using the Fast Fourier Transform. 

The Base Case 

If the multiplication over GF(2)[rr] at the word level is not provided by the 
processor, one should implement it in software. Consider a 64-bit computer 
for example. One needs a routine multiplying two 64-bit words — interpreted 
as two binary polynomials of degree at most 63 — and returning the product 
as two 64-bit words, being the lower and upper part of the product. We show 
here how to efficiently implement such a routine using word-level operations. 
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For an integer i, write 7r(i) the binary polynomial corresponding to i, 
and for a polynomial p, write v(p) the integer corresponding to p. We have 
for example 7r(0) = 0, 7r(l) = 1, 7r(2) = x, 7r(3) = a; + 1, 7r(4) = x 2 , and 

i/(x 4 + 1) = 17, . . . (Line El means that hT" + £ is shifted by k bits to the 

Algorithm 32 GF2MulBaseCase 

Input: nonnegative integers < a,b < 2 W = j3 (word base). 
Output: nonnegative integers £, h with 7r(a)7r(6) = 7r(h)x w + ir(£). 
Require: a slice size parameter k, and n = \w/k]. 
1: Compute A s = u(tt(o)'k(s) mod x w ) for all < s < 2 k . 

Write b in base 2 k : b = b + &i2 fc -\ \- & n _i2( n-1 >* where < bj < 2 k . 

tn-Otii-A^ 
for i from n — 2 downto do 
h2 w + £ <- (/i2 w + £) < jfe 

u = fe, m <- (2 fc " x - l)^Er mod 2W 

for j from 1 to k — 1 do > repair step 

u <— (w ^> 1) A m 

if bit it) — j of a is set, then h <— h © u 



2 
3 
4 
5 
6 

7 

8 

9 

10 



left, the low w bits are stored in £, and the upper bits in h.) 

Consider for example the multiplication of a = (110010110)2 -- which 
represents the polynomial x 8 + x 7 + x 4 + x 2 + x - - by b = (100000101)2 
- which represents the polynomial x 8 + x 2 + 1 -- with w = 9 and k = 3. 
Step [U will compute A = 0, A x = a = (110010110) 2 , A 2 = jo mod i 9 = 
x 8 +x 5 +x 3 +x 2 , A 3 = (x+l)a mod x 9 = x 7 +x 5 +x 4 +x 3 +x, A 4 = x 6 +x 4 +x 3 , 
A 5 = x 8 +x 7 +x 6 +x 3 +x 2 +x, A 6 = x 8 + x 6 +x 5 +x 4 +x 2 , A 7 = x 7 +x 6 +x 5 + x. 
The decomposition of b in step E] gives bo = 5, b\ = 0, b 2 = 4. We thus 
have n = 3,h = (000000000) 2 ,^ = A 4 = (001011000) 2 . After step El for 
i = 1, we have h = (000000001) 2 ,^ = (011000000) 2 , and after step El £ is 
unchanged since bi = and A = 0. Now for % = 0, after step El we have ft, = 
(000001011)2,^ = (000000000)2, and after stepH ^ = A b = (111001110) 2 . 

Steps El to E3 form a "repair" loop. This is necessary because some upper 
bits from a are discarded during creation of the table A. More precisely, 
bit w — j of a is discarded in A s whenever w — j + deg(-7r(s)) > w, i.e., 
deg(-7r(s)) > j; in the above example the most significant bit of a is discarded 
in A 2 to A 7 , and the second most significant bit is discarded in A 4 to A 7 . 
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Thus if bit w — j of a is set, for each monomial x l in some 7r(s) with t > j, or 
equivalently for each set bit 2' in some bi, one should add the corresponding 
bit product. This can be performed in parallel for all concerned bits of b using 
a binary mask. Step [7] computes u = (100000101) 2 , m = (011011011) 2 . (The 
mask m prevents bits from a given set of k consecutive bits from carrying 
over to the next set.) Then for j = 1 in step El we have u = (010000010)2, 
and since bit 8 of a was set, h becomes h © u = (010001001)2- For j = 
2 we have u = (001000001)2, and since bit 7 of a is also set, h becomes 
h © u = (011001000)2. This corresponds to the correct product ir(a)ir(b) = 

X W + 3,15 + ^12 _|_ x 8 + x 7 + x 6 + x 3 + ^2 + % 

The complication of repair steps could be avoided by the use of double- 
length registers, but this would be significantly slower. 

Toom-Cook 3- Way 

Karatsuba's algorithm requires only 3 evaluation points, for which one usu- 
ally chooses 0, 1, and oo. Since Toom-Cook 3- way requires 5 evaluation 
points, one may think at first glance that it is not possible to find so many 
points in GF(2). The idea to overcome this difficulty is to use the tran- 
scendental variable x — and combinations of it — as evaluation points. An 
example is given in Algorithm GF2ToomCook3. With C = Cq + C\X + 
c 2 X 2 + c 3 X 3 + c 4 A 4 , we have vi = A(l)B(l) = C(l) = c + c\ + c 2 + c 3 + c 4 , 
thus W\ = C\ + c 2 + c 3 . Similarly, we have v x = c + c±x + c 2 x 2 + c 3 x 3 + c 4 x 4 , 
thus w x = C\ + c 2 x + c 3 x 2 , and w±/ x = C\X 2 + c 2 x + c 3 . Then t = c\ + c 3 , 
which gives c 2 as W\ + t\ {wy x + c 2 x + t)/{l + x 2 ) gives C\, and t + Ci gives c 3 . 
The exact divisions by 1 + x 2 can be implemented efficiently by Hensel's 
division f< jl.4.8|) . since we know the remainder is zero. More precisely, if T(x) 
has degree < n, Hensel's division yields: 

T(x) = Q(x)(l+x 2 ) + R(x)x n , (2.3) 

where Q(x) of degree less than n is defined uniquely by Q(x) = T(x)/(l + 
x 2 ) mod x n , and degR(x) < 2. This division can be performed word-by- 
word, from the least to the most significant word of T(x) (as in Algorithm 
Const ant Divide from fcll.4.7Jl . 

Assume that the number w of bits per word is a power of 2, say w = 32. 
One uses the identity: 



x 2 



x 2 )(l - x 4 )(l - x 8 ){l - x 16 ) mod x 32 . 
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Algorithm 33 GF2ToomCook3 

Input: binary polynomials A = Y17=o a i x% ■> B = Y^h=o ^^ 
Output: C = AB = c + c ± X + c 2 X 2 + c 3 X 3 + c A X A 

Let X = x k where k = \n/3] 

Write A = a + a t X + a 2 X 2 , B = b + 6 X X + b 2 X 2 , deg(a;), deg(fej) < fc 

c <- ao^o 

c 4 <— a 2 6 2 

fi <- (oo + «1 + «2)(^0 + &i + 62) 

v x *— (ao + cua; + a 2 x 2 )(6 + &i£ + b 2 x 2 ) 

Vi/x *— (^o^ 2 + a i^ + a 2 )(6 £ 2 + &i£ + b 2 ) 

W\ <— V\ — Co — C4 

»! <- (^ - c - c 4 x 4 )/a; 
ifli/j <- (vi/* - co^ 4 - c 4 )/a; 
t <- (i^ + Wi /a; )/(l + x 2 ) 
c 2 <— 10 1 + t 

Cl <- (Wl/i + c 2 x + t)/(l + x 2 ) 

C 3 <- t + Ci 



(Of course, the "— " signs in this identity can be replaced by "+" signs when 
the base field is GF(2), and we will do this from now on.) Hensel's division 
of a word t(x) of T(x) by 1 + x 2 writes 

t( x ) =q{x){l+x 2 )+r(x)x w , (2.4) 

where deg(q) < w and deg(r) < 2, then (1 + x 2 )(l + x 4 )(l + x 8 )(l + 



.r 



16 > 



•t(x) = g(x) mod x w . The remainder r(x) is simply formed by the two 
most significant bits of q(x), since dividing (|2.4|) by x w yields q(x) div x w ~ 2 = 
r(x). This remainder is then added to the following word of T(x), and the 
last remainder is known to be zero. 

Using the Fast Fourier Transform 

The algorithm described in this section multiplies two polynomials of degree 
n from GF(2) [x] in time 0(n log n log log n), using the Fast Fourier Transform 
(FFT). 

The normal FFT requires 2 fc -th roots of unity, so does not work in 
GF(2)[x] where these roots of unity fail to exist. The solution is to replace 
powers of 2 by powers of 3. 
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Assume we want to multiply two polynomials of GF(2)[x]/(x JV + l), where 
N = KM, and K is a power of 3. Write the polynomials a, b (of degree less 
than N) that are to be multiplied as follows: 

K-l K-l 

a = J2 a * xlM > b = E b * xiM > 

i=0 i=0 

where the Oj, bi are polynomials of degree less than M. 

Consider the at as elements of Rn> = GF(2)[x]/(x 2N + x N + 1), for an 
integer N' = KM' JZ. Let u) = x M ' G R N >. Clearly we have 1+uj k ^+uj 2K/3 = 
0, and u) = 1 in R^>. 

The Fourier transform of (a , a±, . . . , ax-i) is (ao, Oi, • • • , o-k-i) where 

K-l 
3=0 

and similarly for (6 , &i, . . . , &k-_i). The pointwise product of (a , . . . , Ok_i) 
by (&o, • • • , b K -\) gives (c , . . . , c K -i) where 

\j=o / \fc=o / 

Now the inverse transform of (c , . . . , Ck-i) is defined as (co, . . . , c^_i) 
where: 

c e =Y^ w~ l %. 

i=0 

(Note that since uj k = 1, we have u~ x = u~ XmodK , or oj~ 1 = u K ~ 1 = 

x M'(K-l)^ Thig giveg . 

K-l /K-l \ /K-l \ 

c ^ = E "~ H E w % E ^ 

i=0 \j=0 / \fe=0 / 

K-l K-l K-l 

j=0 k=0 i=0 

Write t:= j + k-L If t ^ mod X, then J]^" 1 w ft = ^±± = since 
w* = 1 but wV 1- If i = mod K, then w 4 ' = 1; since -K < j+k-t < 2K, 
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this happens only for j + k — £ £ {0, K}. In that case the sum ^?.=o u) l U +k ~Q 
equals K, i.e., 1 modulo 2 (remember K is a power of 3). 
We thus have for < £ < K - 1: 



(■( 



J2 a A+ Yl a A 

.j+k=e j+k=e+K 



mod (x + x 



If iV > M, eg = Y^j+k=i a jbk + l>2j + k=e+K a jbk, and the polynomial c = 
2~^=o Q£ £M thus equals ab mod (x^ + 1). 

Like Schonhage-Strassen's integer multiplication, this algorithm can be 
used recursively. Indeed, the pointwise products mod x 2N + x N + 1 can 
be performed mod x 3N + 1, therefore a multiplication mod x N + 1 reduces 
to about K products mod x 3N l K + 1, assuming N' = M. This yields the 
asymptotic complexity of O (N log N log log N). 

In practice usually the algorithm is used only once: a multiplication of 
two degree- N/2 polynomials can be performed with K products of degree 
« 2N/K polynomials. 



Example: consider TV = 3 • 5, with K = 3, M = 5, N' = 5, M' = 5, 

a = x 14 + x 13 + x 12 + x 11 + x 10 + x 8 + x 6 + x 5 + x 4 + x 3 + x 2 + l, 

b = x 13 + x 11 + x 8 + x 7 + x 6 + x 2 . 

We have (oo, Oi, 02) = (^ 4 + £ 3 + £ 2 + 1, x 3 + a; + 1, x 4 + x 3 + x 2 + x + 1), 
(60,61,62) = (x 2 ,x 3 + x 2 + x,x 3 + x). In the Fourier domain, we perform 
computations modulo x 10 + x 5 + 1, and u; = x 5 , which gives 

(00,01,02) = (^ 3 + 1, £ 9 + £ 7 + x, x 9 + x 7 + x 4 + x 2 + x), 

(60,61,62) = (0,x 7 + x 3 + x 2 + x,x 7 + x 3 + x), 

(c ,ci,c 2 ) = (0,x 7 + x 6 + x 3 ,x 6 + x 3 ), 

(c ,ci,c 2 ) = (x 7 ,x 6 + x 3 + x 2 ,x 7 + x 6 + :e 3 + :e 2 ), 

and 

c + x M Ci + x 2M c 2 = x 13 + x 12 + x 11 +x 8 + x 2 + x mod (x 15 + 1). 



K/3-1 




K/3-1 


Y^ v Mj a 3j+ i, 


a",'- 


- £ " 3 " 


j=0 




j=0 
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We describe how the Fast Fourier Transform is performed. We want to 
compute efficiently 

K-l 

dj = 2_, w 1,7 'oj, 

i=o 

where K = 3 fc . Define FFT(w, (a , . . . , ok-i)) = Oo, • • • , ok-i). For < £ < 
K/S we define 

a' = FFT(u; 3 , (a , o 3 , . . . , ax/3-3)), 

a" = FFT(w 3 , (01, a 4 , . . . , ojf/ 3 _ 2 )), 

a'" = FFT(^ 3 ,(a 2 ,a 5 ,...,a^ /3 _ 1 )), 



K/3-1 

thus we have for < £ < K/3: 

ae = a'p + u a" + u a"', 

n , - n' -L- , /+ K / 3 n" -L- , ^£+2K/3 r J/r 

cie+K/3 — (if, -\- ll> dg + uj a e 1 

a £+2K/3 — CL e -\- Ll> dp + LO ' 0,£ . 

This group of operations is the equivalent of the "butterfly" operation 
(a <— a + b,b <— a + tob) in the classical FFT. 



2.4 Division and Inversion 

We have seen above that modular multiplication reduces to integer division, 
since to compute ab mod N, the classical method consists in dividing ab by 
N as ab = qN + r, then ab = r mod N. In the same vein, modular division 
reduces to an integer (extended) gcd. More precisely, the division a/b mod N 
is usually computed as a ■ (1/6) mod N, thus a modular inverse is followed 
by a modular multiplication. We therefore concentrate on modular inversion 
in this section. 

We have seen in Chapter^ that computing an extended gcd is expensive, 
both for small sizes where it usually costs several multiplications, or for large 
sizes where it costs 0(M(n) log n). Therefore modular inversions should be 
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avoided if possible; we explain at the end of this section how this can be 
done. 

Algorithmic (Modularlnverse) is just Algorithm ExtendedGcd (£ 11.6.2)) . 
with (a, 6) — > (6, N) and the lines computing the cofactors of TV omit- 
ted. Algorithm Modularlnverse is the naive version of modular inversion, 

Algorithm 34 Modularlnverse 
Input: integers 6 and N, b prime to N. 
Output: integer u = 1/6 mod N . 
(u,w) <- (1,0), c<- N 
while c^Odo 

(q, r) ^DivRem(6, c) 

(6,c) <- (c,r) 

(u, w) <— (it), « — qw) 

return u. 

with complexity 0(n 2 ) if A^ takes n words in base f3. The subquadratic 
0(M(n) logn) algorithm is based on the HalfGcd algorithm (S J1.6.3J1 . 

When the modulus A" has a special form, faster algorithms may exist. In 
particular for A" = p k , 0(M(n)) algorithms exist, based on Hensel lifting, 
which can be seen as the p-adic variant of Newton's method ( ^4.2)1 . To 
compute 1/6 mod N, we use a p-adic version of the iteration Eq. (J4.5J) : 

Xj + i = Xj + Xj(l — bxj) mod p k . (2.5) 

Assume Xj approximates \/b to "p-adic precision" £, i.e., bxj = 1 +ep e . Then 
bxj +1 = bxj(2 — bxj) = (1 + ep e )(l — ep l ) = 1 — e 2 p 2e . Therefore Xk+i is an 
approximation of 1/6 to double precision (in the p-adic sense). 

As an example, assume one wants to compute the inverse of an odd integer 
b modulo 2 32 . The initial approximation Xq = 1 satisfies xq = 1/6 mod 2, thus 
five iterations are enough. The first iteration is X\ <— Xq + Xq(1 — bx<)) mod 2 2 , 
which simplifies to X\ ^2 — 6 mod 4 since Xq = 1. Now whether 6=1 mod 4 
or 6 = 3 mod 4, we have 2 — 6 = 6 mod 4, thus one can start directly the 
second iteration with x\ = 6: 

x 2 <- 6(2 - 6 2 ) mod 2 4 
^3 <— ^2(2 — bx 2 ) mod 2 8 
£4 <- ^3(2 — 6x3) mod 2 16 
x 5 <— x 4 (2 — 6x4) mod 2 32 



x3 = 


= x2 * (2 


- b * x2) ; 


x5 = 


= x4 * (2 


- b * x4) ; 
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Consider for example b = 17. The above algorithm yields X2 = 1, £3 = 241, 
£4 = 61681 and x§ = 4042322161. Of course, any computation mod p l might 
be computed modulo p k for k > £, in particular all above computations might 
be performed modulo 2 32 . On a 32-bit computer, arithmetic on basic integer 
types is usually performed modulo 2 32 , thus the reduction comes for free, and 
one will usually write in the C language (using unsigned variables): 

x2 = b * (2 - b * b); 
x4 = x3 * (2 - b * x3) ; 

Another way to perform modular division when the modulus has a special 
form is Hensel's division (£ 11.4.8(1 . For a modulus iV = j3 n , given two integers 
A,B, we compute Q and R such that 

A = QB + R(3 n . 

Therefore we have A/B = Q mod (3 n . While Montgomery's modular multi- 
plication only needs the remainder R of Hensel's division, modular division 
needs the quotient Q, thus Hensel's division plays a central role in modular 
arithmetic modulo f3 n . 

2.4.1 Several Inversions at Once 

A modular inversion, which reduces to an extended gcd ( fll.6.2(l . is usually 
much more expensive than a multiplication. This is true not only in the FFT 
range, where a gcd takes time 0(M (n) logn), but also for smaller numbers. 
When several inversions are to be performed modulo the same number, the 
following algorithm is usually faster: 

Proof. We have Z{ = X\X2 ■ ■ ■ Xi mod N, thus at the beginning of step i 
(line [5|), q = (x\ . . . Xi)~ x mod N, which indeed gives yi = l/ar, mod N . q 

This algorithm uses only one modular inversion, and 3(fc — 1) modular mul- 
tiplications. Thus it is faster than k inversions when a modular inversion is 
more than three times as expensive as a product. Fig. 12.31 shows a recursive 
variant of the algorithm, with the same number of modular multiplications: 
one for each internal node when going up the (product) tree, and two for 
each internal node when going down the (remainder) tree. This recursive 
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Algorithm 35 Multiplelnversion 

Input: < xi, . . . , Xk < N 

Output: yi = \jx\ mod N, . . . ,y k = 1/x* mod N 



Z\ <— X\ 

for % from 2 to k do 

Zi <— ^j-i^j mod A 7 ' 

g <— 1/zk mod A 7 ' 

for i from k downto 2 do 

Vi <- g^_i mod AT 
g <— gxj mod A" 




Figure 2.3: A recursive variant of Algorithm Multiplelnversion. First 
go up the tree, building x±X2 mod N from x\ and X2 in the left branch, 
X3X4 mod A" in the right branch, and X1X2X3X4 mod A" at the root of the 
tree. Then invert the root of the tree. Finally go down the tree, multiplying 
by the stored value x 3 x 4 to get — — , and so on. 



X1X2X3X4 



variant might be performed in parallel in 0(log k) operations using k proces- 
sors, however the total memory usage is 0(fclogfc) residues, instead of 0(k) 
for the linear algorithm Multiplelnversion. 

A dual case is when there are several moduli but the number to invert is 
fixed. Say we want to compute 1/x mod N%, . . . , 1/x mod A^. We illustrate 
a possible algorithm in the case k = 4. First compute N = Ny . . . N^ using a 
product tree like that in Fig. 12.31 for example first compute N1N2 and N3N4, 
then multiply both to get A" = (AiA^XA^A^). Then compute y = 1/x mod 
N, and go down the tree, while reducing the residue at each node. In our 
example we compute z = y mod (NiN 2 ) in the left branch, then z mod Ny 
yields l/x mod Ny. An important difference between this algorithm and the 
algorithm illustrated in Fig. 12.31 is that here, the numbers grow while going 
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up the tree. Thus, depending on the sizes of x and the Nj, this algorithm 
might be of theoretical interest only. 

2.5 Exponentiation 

Modular exponentiation is the most time-consuming mathematical operation 
in several cryptographic algorithms. The well-known RSA algorithm is based 
on the fact that computing 

c = a e mod iV (2.6) 

is relatively easy, but recovering a from c, e and TV is difficult, especially when 
N has at least two large prime factors. The discrete logarithm problem is 
similar: here c, a and N are given, and one looks for e satisfying Eq. ([2.6)1 . 

When the exponent e is fixed (or known to be small), an optimal sequence 
of squarings and multiplications might be computed in advance. This is 
related to the classical addition chain problem: What is the smallest chain 
of additions to reach the integer e, starting from 1? For example if e = 15, 
a possible chain is: 

1, 1 + 1 = 2, 1 + 2 = 3, 1 + 3 = 4, 3 + 4 = 7, 7 + 7 = 14, 1 + 14 = 15. 

The length of a chain is defined to be the number of additions needed to 
compute it. Thus this chain has length 6 (not 7). An addition chain readily 
translates to an exponentiation chain: 

2 2 S S 4 S 4 7 7 7 14 14 15 

a, a ■ a = a ,a ■ a = a ,a ■ a = a ,a -a = a , a -a = a ,a ■ a = a . 

A shorter chain for e = 15 is: 

1, 1 + 1 = 2, 1 + 2 = 3, 2 + 3 = 5, 5 + 5 = 10, 5 + 10 = 15. 

This chain is the shortest possible for e = 15, so we write <r(15) = 5, where 
in general a(e) denotes the length of the shortest addition chain for e. In 
the case where e is small, and an addition chain of shortest length a(e) 
is known for e, computing a e mod TV may be performed in a(e) modular 
multiplications. 

When e is large and (a,N) = 1, then e might be reduced modulo <j)(N), 
where (f)(N) is Euler's totient function, i.e., the number of integers in [1,N] 
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which are relatively prime to N. This is because a^ N ^ = 1 mod N whenever 
(a,N) = 1 (Fermat's little theorem). 

Since (j>(N) is a multiplicative function, it is easy to compute (f)(N) if we 
know the prime factorisation of N. For example, 

<f>(1001) = <f>(7 ■ 11 • 13) = (7 - 1)(11 - 1)(13 - 1) = 720, 

and 2009 = 569 mod 720, so 17 2009 = 17 569 mod 1001. 

Assume now that e is smaller than (p(N). Since a lower bound on the 
length a(e) of the addition chain for e is lge, this yields a lower bound 
(lge)M(n) for the modular exponentiation, where n is the size of N. When 
e is of size k, a modular exponentiation costs 0{kM(n)). For k = n, the cost 
0(nM(n)) of modular exponentiation is much more than the cost of oper- 
ations considered in Chapter^ with 0(M(n) log n) for the more expensive 
ones there. The different algorithms presented in this section save only a 
constant factor compared to binary exponentiation. 

2.5.1 Binary Exponentiation 

A simple (and not far from optimal) algorithm for modular exponentiation is 
"binary exponentiation". Two variants exist: left-to-right and right-to-left. 
We give the former in Algorithm LeftToRightBinaryExp and leave the 
latter as an exercise for the reader. Left-to-right binary exponentiation has 

Algorithm 36 LeftToRightBinaryExp 
Input: a, e, N positive integers 
Output: x = a e mod N 

1: Let (e^e£_i . . . eie ) be the binary representation of e 

2: x <— a 

3: for % = £ — 1 downto do 

4: x <— x 2 mod N 

5: if e,- = 1 then x <— ax mod iV 



two advantages over right-to-left exponentiation: 

• it requires only one auxiliary variable, instead of two for the right-to- 
left exponentiation: one to store successive values of a 2 \ and one to 
store the result; 
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• in the case where a is small, the multiplications ax at step always 
involve a small operand. 

If e is a random integer of n bits, step |5] will be performed on average 
n/2 times, thus the average cost of Algorithm LeftToRightBinaryExp 
is \lM(n). 
Example: for the exponent e = 3499211612, which is 

(11010000100100011011101101011100)2 

in binary, Algorithm LeftToRightBinaryExp performs 31 squarings and 
15 multiplications (one for each 1-bit, except the most significant one). 

2.5.2 Base 2 k Exponentiation 

Compared to binary exponentiation, base 2 k exponentiation reduces the 
number of multiplications ax mod TV (Algorithm LeftToRightBinaryExp, 
step HJ). The idea is to precompute small powers of a mod iV: The precom- 

Algorithm 37 BaseKExp 
Input: a,e,N positive integers 
Output: x = a e mod TV 
1: Precompute a 2 then t[i] := a 1 mod N for 1 < i < 2 k 



Let (eiei-i . . . eieo) be the base 2 k representation of e 

x <- t[e e ] 

for i = £ — 1 downto do 

x <— x 2 mod N 

if e; 7^ then x <— t[ei\x mod N 



putation cost is (2 k — 2)M(n), and if the digits ej are random and uniformly 
distributed in [0, 2 k — 1], then step 13 is performed with probability 1 — 2~ k . If 
e has n bits, the number of loops is about n/k. Ignoring the squares at step El 
whose total cost depends on k£ ~ n (independent of k), the total expected 
cost in terms of multiplications modulo iV is: 

2 fc -2 + -(l-2- fc ). 
k 

For k = 1 this formula gives n/2; for k = 2 it gives |n + 2, which is faster for 
n > 16; for k = 3 it gives ^n + 6, which is faster than the k = 2 formula for 
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n > 48. When n is large, the optimal value of k is when k 2 2 k « n/ log 2. A 
disadvantage of this algorithm is its memory usage, since 0(2 fc ) precomputed 
entries have to be stored. 

Example: consider the exponent e = 3499211612. Algorithm BaseKExp 
performs 31 squarings independently of k, thus we count multiplications only. 
For k = 2, we have e = (3100210123231 130) 4 : and Algorithm BaseKExp 
performs one multiplication to precompute a 2 , and 11 multiplications for 
the non-zero digits of e in base 4 (except the leading one). For k = 3, we 
have e = (32044335534)g, and the algorithm performs 6 multiplications to 
precompute a 2 , a 3 , . . . , a 7 , and 9 multiplications in step El 

This example shows two facts. First, if some digits — here 6 and 7 -- do 
not appear in the base-2 fc representation of e, we do not need to precompute 
the corresponding powers of a. Second, when a digit is even, say e^ = 2, 
instead of doing 3 squarings and multiplying by a 2 , one could do 2 squarings, 
multiply by a, and perform a last squaring. This leads to the following 
algorithm: The correctness of steps EHH1 follows from: 

Algorithm 38 BaseKExpOdd 
Input: a, e, A positive integers 
Output: x = a e mod A 
1: Precompute t[i] := a % mod A for i odd, 1 < i < 2 k , 
Let (eeei-i . . . eieo) be the base 2 fc representation of e 
x <- t[eg] 
for % = £ — 1 downto do 

write ei = 2 m d with d odd (if e^ = then m = d = 0) 

x <— x 2 n mod N 

if et 7^ then x <— t[d]x mod A 

a; <— x 2 " 1 mod A 



x l a 1 d = (x z a d ) z . 

On our example, with k = 3, this algorithm performs only 4 multiplications 
to precompute a 2 then a 3 , a 5 , a 7 , and 9 multiplications in step[7J 

2.5.3 Sliding Window and Redundant Representation 

The "sliding window" algorithm is a straightforward generalization of Algo- 
rithm BaseKExpOdd. Instead of cutting the exponent into fixed parts of k 
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bits each, the idea is to divide it into windows, where two adjacent windows 
might be separated by a block of zero or more 0-bits. The decomposition 
starts from the least significant bits, for example with e = 3499211612, in 
binary: 

1 101 00 001 001 00 011 011 101 101 111 00. 

e 8 e 7 e 6 e 5 e 4 e 3 e 2 e.\ e 

Here there are 9 windows (indicated by es,---,eo above) and we perform 
only 8 multiplications, an improvement of one multiplication over Algorithm 
BaseKExpOdd. On average, the sliding window algorithm leads to about 
\k+i~\ windows instead of |~|] with (fixed windows and) base-2 fc exponentia- 
tion. 

Another possible improvement may be feasible when division is possible 
(and cheap) in the underlying group. For example, if we encounter three 
consecutive ones, say 111, in the binary representation of e, we may replace 
some bits by —1, denoted by 1, as in 1001. We have thus replaced three 
multiplications by one multiplication and one division, in other words x 7 = 
x 8 ■ x~ l . On our running example, this gives: 

e = llOlOOOOlOOlOOlOOlOOOlOOlOlOOlOO, 

which has only 10 non-zero digits, apart from the leading one, instead of 15 
with bits and 1 only. The redundant representation with bits 0, 1 and 
1 is called the Booth representation. It is a special case of the Avizienis 
signed-digit redundant representation. Signed-digit representations exist in 
any base. 

For simplicity we have not distinguished between the cost of multiplica- 
tion and the cost of squaring (when the two operands in the multiplication 
are known to be equal), but this distinction is significant in some applications 
(e.g., elliptic curve cryptography). Note that, when the underlying group op- 
eration is denoted by addition rather than multiplication, as is usually the 
case for groups defined over elliptic curves, then the discussion above applies 
with "multiplication" replaced by "addition" , "division" by "subtraction" , 
and "squaring" by "doubling". 

2.6 Chinese Remainder Theorem 

In applications where integer or rational results are expected, it is often 
worthwhile to use a "residue number system" (as in A2.1.3J1 and perform all 
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computations modulo several small primes. The final result is then recov- 
ered via the Chinese Remainder Theorem (CRT) For such applications, it is 
important to have fast conversion routines from integer to modular represen- 
tation, and vice versa. 

The integer to modular conversion problem is the following: given an 
integer x, and several prime moduli m,, 1 < % < n, how to efficiently compute 
X{ = x mod m,;, for 1 < i < nl This is exactly the remainder tree problem, 
which is discussed in < 32.4.1I (see also Ex. I1.28|) . 

The converse CRT reconstruction problem is the following: given the 
Xi, how to efficiently reconstruct the unique integer x, < x < M = 
mim2 ■■■ TUk, such that x = Xi moduli, for 1 < i < kl It suffices to 
solve this problem for k = 2, and use the solution recursively. Assume 
that x = a± mod m-i, and x = a 2 mod m 2 . Write x = Ami + /im^ Then 
//m 2 = a\ mod mi, and Am 2 = a 2 mod m-2. Assume we have computed an ex- 
tended gcd of mi and m 2 , i.e., we know integers u, v such that Mm 1 +«m 2 = 1- 
We deduce a = va± mod mi, and A = ua 2 mod m 2 . Thus 

x <— (ua,2 mod mjjmi + (vai mod m\)m2- 

This gives x < 2mim 2 . If x > mim 2 then set x <— x — mpj to ensure that 
< x < mim.2. 



2.7 Exercises 

Exercise 2.1 Show that, if a symmetric representation in [— N/2, N/2) is used 
in Algorithm Modular Add ( $12.2|) . then the probability that we need to add or 

subtract TV" is 1/4 if TV" is even, and (1 — l/N 2 )/4 if N is odd (assuming in both 
cases that a and b are uniformly distributed). 

Exercise 2.2 Modify Algorithm GF2MulBaseCase ( ^2.3.51 The Base Case) to 
use a table of size 2' k ' 2 > instead of 2 k , with only a small slowdown. 

Exercise 2.3 Write down the complexity of Montgomery-Svoboda's algorithm 
f ^2.3.2|) for k steps. For k = 3, use relaxed Karatsuba multiplication J169J to save 
one M(n/3) product. 

Exercise 2.4 Assume you have an FFT algorithm computing products modulo 
2 n + 1. Prove that, with some preconditioning, you can perform a division of a 
2n-bit integer by an n-bit integer as fast as 1.5 multiplications of n bits by n bits. 
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Exercise 2.5 Assume you know|?(ir) mod (x ni +l) and p(x) mod (x n2 +l), where 
p(x) 6 GF(2)[x] has degree n — 1, and m > n-i. Up to which value of n can you 
uniquely reconstruct pi Design a corresponding algorithm. 

Exercise 2.6 Analyze the complexity of the algorithm outlined at the end 
of ^2.4.1l to compute 1/x mod N±, . . . , 1/x mod Nk, when all the N{ have size 
n, and x has size £. For which values of n, £ is it faster than the naive 
algorithm which computes all modular inverses separately? [Assume M(n) 
is quasi-linear, and neglect multiplication constants.] 

Exercise 2.7 Write a RightToLeftBinaryExp algorithm and compare it 
with Algorithm LeftToRightBinaryExp of < J2.5.11 

Exercise 2.8 Analyze the complexity of the CRT reconstruction algorithm 
outlined in ^2.61 for M = m 1 m 2 ■ ■ ■ i^k, with M having size n, and the m, of 
size n/k each. [Assume M(n) ~ nlogn for simplicity.] 

2.8 Notes and References 

Several number theoretic algorithms make heavy use of modular arithmetic, in 
particular integer factorization algorithms (for example: Pollard's p algorithm and 
the elliptic curve method). 

Another important application of modular arithmetic in computer algebra is 
computing the roots of a univariate polynomial over a finite field, which requires 
efficient arithmetic over F p [x]. 

We say in E J2.1.3l that residue number systems can only be used when A^ factors 
into N1N2 . . .; this is not quite true, since Bernstein shows in ^3] how to perform 
modular arithmetic using a residue number system. 

Some examples of efficient use of the Kronecker-Schonhage trick are given by 
Steel H5HJ- 

The basecase multiplication algorithm from f ^2.3.5l The Base Case) (with re- 
pair step) is a generalization of an algorithm published in NTL (Number Theory 
Library) by Shoup. The Toom-Cook 3-way algorithm from < J2.3.5I was designed 
by the second author, after discussions with Michel Quercia, who proposed the 
exact division by 1 + x 2 ; an implementation in NTL has been available on the web 
page of the second author since 2004. The Fast Fourier Transform multiplication 
algorithm is due to Schonhage J149J . 

The original description of Montgomery's REDC algorithm is |130j . It is now 
widely used in several applications. However only a few authors considered using 
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a reduction factor which is not of the form (3 n , among them McLaughlin 
and Mihailescu |127j . The folding optimization of REDC described in £12.3.21 (Sub- 
quadratic Montgomery Reduction) is an LSB-extension of the algorithm described 
in the context of Barrett's algorithm by Hasenplaugh, Gaubatz and Gopal [951 . 

The description of McLaughlin's algorithm in £12.3.31 follows [1231 Variation 2]; 
McLaughlin's algorithm was reformulated in a polynomial context by Mihailescu 

ma. 

Many authors have proposed FFT algorithms, or improvements of such algo- 
rithms. Some references are Aho, Hopcroft and Ullman [2|; Borodin and Munro |26j . 
who describe the polynomial approach; Van Loan |119j for the linear algebra ap- 
proach; and Pollard J142J for the FFT over finite fields. In Bernstein |181 §23] the 
reader will find some historical remarks and several nice applications of the FFT. 

Recently Fiirer |80j has proposed an integer multiplication algorithm that is 
asymptotically faster than the Schonhage-Strassen algorithm. Fiirer's algorithm 
almost achieves the conjecture best possible ©(nlogn) running time. 

Concerning special moduli, Percival considers in J140J the case N = a ± b 
where both a and b are highly composite; this is a generalization of the case 
N = /3 n ± 1. The pseudo-Mersenne primes from £ 12.3,41 are recommended by the 
National Institute of Standards and Technology (NIST) |§5]. See also the book 



The statement in £12.3.51 that modern processors do not provide a multiply 
instruction over GF(2)[x] was correct when written in 2008, but the situation may 
soon change, as Intel plans to introduce such an instruction (PCMULQDQ) on its 
"Westmere" chips scheduled for release in 2009. 

The description in £ 12.3.51 of fast multiplication algorithms over GF(2)[x] is 
based on the paper [31], which gives an extensive study of algorithms for fast 
arithmetic over GF(2)[x]. For an application to factorisation of polynomials over 
GF(2), see |5Q|. 

The FFT algorithm described in £12.3.51 (Using the Fast Fourier Transform) is 
due to Schonhage J149J and is implemented in the gf2x package |43j . Many FFT 
algorithms and variants are described by Arndt in (Hj: Walsh transform, Haar 
transform, Hartley transform, number theoretic transform (NTT). 

Algorithm Multiplelnversion — also known as "batch inversion" - is due 
to Montgomery |131j . 

Modular exponentiation algorithms are described in much detail in the Hand- 
book of Applied Cryptography by Menezes, van Oorschot and Vanstone [1241 Chap- 
ter 14]. A detailed description of the best theoretical algorithms, with due credit, 
can be found in [T£]. When both the modulus and base are invariant, mod- 
ular exponentiation with fc-bit exponent and n-bit modulus can be performed 
in 0(prM(n)), after a precomputation of O(r-^r) powers in time 0{kM{n)). 
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Take for example b = 2 k / 1 in Note 14.112 and Algorithm 14.109 from [T21j . with 
ilogt ~ k, where the powers a mod TV for < i < t are precomputed. An al- 
gorithm of same complexity using a DBNS (Double-Based Number System) was 
proposed by Dimitrov, Jullien and Miller |71| . however with a larger table, namely 
of Q(k 2 ) precomputed powers. 

A quadratic algorithm for CRT reconstruction is discussed in [BO]; Moller gives 
some improvements in the case of a small number of small moduli known in advance 
|129j . The explicit Chinese Remainder Theorem and its applications to modular 
exponentiation are discussed by Bernstein and Sorenson in [2 



Chapter 3 

Floating-Point Arithmetic 



This Chapter discusses the basic operations - - addition, sub- 
traction, multiplication, division, square root, conversion - - on 
arbitrary precision floating-point numbers, as Chapter ^ does for 
arbitrary precision integers. More advanced functions like ele- 
mentary and special functions are covered in Chapter 0] This 
Chapter largely follows the IEEE 754 standard, and extends it 
in a natural way to arbitrary precision; deviations from IEEE 
754 are explicitly mentioned. Topics not discussed here include: 
hardware implementations, fixed-precision implementations, spe- 
cial representations. 



3.1 Representation 

The classical non-redundant representation of a floating-point number x in 
radix j3 > 1 is the following (other representations are discussed in 



x = {-l) s -m-p e , (3.1) 

where ( — l) s , s G {0, 1}, is the sign, m > is the significand, and the integer 
e is the exponent of x. In addition, a positive integer n defines the precision 
of x, which means that the significand m contains at most n significant digits 
in radix j3. 

An important special case is m = representing zero x. In this case 
the sign s and exponent e are irrelevant and may be used to encode other 
information (see for example j |3.1.3|) . 

85 



86 Modern Computer Arithmetic, version 0.3 of June 10, 2009 

For m^O, several semantics are possible; the most common ones are: 

• (3~ l < m < 1, then /3 e_1 < |x| < (3 e . In this case m is an integer 
multiple of f3~ n . We say that the unit in the last place of x is /3 e_n , and 
we write ulp(x) = j3 e ~ n . For example x = 3.1416 with radix (3 = 10 is 
encoded by m = 0.31416 and e = 1. This is the convention we will use 
in this Chapter; 

• 1 < m < (3, then j3 < \x\ < [3 e+l , and ulp(x) = (3 e+1 ~ n . With radix ten 
the number x = 3.1416 is encoded by m = 3.1416 and e = 0. This is 
the convention adopted in the IEEE 754 standard. 

• we can also use an integer significand j3 n ~ l < m < j3 n , then j3 e+n ~ l < 
\x\ < j3 e+n , and ulp(x) = j3 e . With radix ten the number x = 3.1416 is 
encoded by m = 31416 and e = —4. 

Note that in the above three cases, there is only one possible representation 
of a non-zero floating-point number: we have a canonical representation. In 
some applications, it is useful to remove the lower bound on nonzero m, 
which in the three cases above gives respectively 0<m<l,0<m</3, and 
< m < (3 n , with m an integer multiple of /3 e ~ n , /3 e+1_n , and 1 respectively. 
In this case, there is no longer a canonical representation. For example, with 
an integer significand and a precision of 5 digits, the number 3.1400 is encoded 
by (m = 31400, e = -4), (m = 03140, e = -3), and (m = 00314, e = -2) in 
the three cases. However, this non-canonical representation has the drawback 
that the most significant non-zero digit of the significand is not known in 
advance. The unique encoding with a non-zero most significant digit, i.e., 
(m = 31400, e = —4) here, is called the normalised -- or simply normal - 
encoding. 

The significand is also called mantissa or fraction. The above examples 
demonstrate that the different significand semantics correspond to different 
positions of the decimal (or radix (3) point, or equivalently to different biases 
of the exponent. We assume in this Chapter that both the radix j3 and the 
significand semantics are implicit for a given implementation, thus are not 
physically encoded. 

3.1.1 Radix Choice 

Most floating-point implementations use radix /3 = 2 or a power of two, 
because this is convenient and efficient on binary computers. For a radix j3 
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which is not a power of 2, two choices are possible: 

• store the significand in base f3, or more generally in base j3 k for an 
integer k > 1. Each digit in base /3 k requires [fclg/3] bits. With such a 
choice, individual digits can be accessed easily. With j3 = 10 and k = 1, 
this is the "Binary Coded Decimal" or BCD encoding: each decimal 
digit is represented by 4 bits, with a memory loss of about 17% (since 
lg(10)/4 pa 0.83). A more compact choice is radix 10 3 , where 3 decimal 
digits are stored in 10 bits, instead of in 12 bits with the BCD format. 
This yields a memory loss of only 0.34% (since lg(1000)/10 « 0.9966); 

• store the significand in binary. This idea is used in Intel's Binary- 
Integer Decimal (BID) encoding, and in one of the two decimal en- 
codings in the revision of the IEEE 754 standard. Individual digits 
cannot be directly accessed, but one can use efficient binary hardware 
or software to perform operations on the significand. 

In arbitrary precision, a drawback of the binary encoding is that, during the 
addition of two numbers, it is not easy to detect if the significand exceeds the 
maximum value j3 n — 1 (when considered as an integer) and thus if rounding is 
required. Either j3 n is precomputed, which is only realistic if all computations 
involve the same precision n, or it is computed on the fly, which might result 
in 0(M(n)logn) complexity (see Chapter [TJ. 

3.1.2 Exponent Range 

In principle, one might consider an unbounded exponent. In other words, 
the exponent e might be encoded by an arbitrary precision integer (Chap- 
ter ^). This has the great advantage that no underflow or overflow would 
occur (see below). However, in most applications, an exponent encoded in 
32 bits is more than enough: this enables one to represent values up to about 
10 646456993 . A result exceeding this value most probably corresponds to an 
error in the algorithm or the implementation. In addition, using arbitrary 
precision integers for the exponent induces an extra overhead that slows down 
the implementation in the average case, and it requires more memory to store 
each number. 

Thus, in practice the exponent usually has a limited range e m j n < e < 
e m ax- We say that a floating-point number is representable if it can repre- 
sented in the form (— l) s -m-j3 e with e min < e < e max . The set of representable 
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numbers clearly depends on the significand semantics. For the convention we 
use here, i.e., (3~ l < m < 1, the smallest positive floating-point number is 
/3 emin_1 , and the largest is /? 6max (l — (3~ n ). 

Other conventions for the significand yield different exponent ranges. For 
example the IEEE 754 double precision format - - called binary64 in the 
IEEE 754 revision -- has e m i n = —1022, e max = 1023 for a significand in 
[1,2); this corresponds to e min = —1021, e max = 1024 for a significand in 
[1/2, 1), and e min = —1074, e max = 971 for an integer significand in [2 52 , 2 53 ). 

3.1.3 Special Values 

With a bounded exponent range, if one wants a complete arithmetic, one 
needs some special values to represent very large and very small values. Very 
small values are naturally flushed to zero, which is a special number in the 
sense that its significand is m = 0, which is not normalised. For very large 
values, it is natural to introduce two special values — oo and +oo, which en- 
code large non-representable values. Since we have two infinities, it is natural 
to have two zeros —0 and +0, for example l/(— oo) = —0 and l/(+oo) = +0. 
This is the IEEE 754 choice. Another possibility would be to have only one 
infinity oo and one zero 0, forgetting the sign in both cases. 

An additional special value is Not a Number (NaN), which either repre- 
sents an uninitialised value, or is the result of an invalid operation like \f—l 
or (+oo) — (+oo). Some implementations distinguish between different kinds 
of NaN, in particular IEEE 754 defines signalling and quiet NaNs. 

3.1.4 Subnormal Numbers 

Subnormal numbers are required by the IEEE 754 standard, to allow what is 
called gradual underflow between the smallest (in absolute value) non-zero 
normalised numbers and zero. We first explain what subnormal numbers are; 
then we will see why they are not necessary in arbitrary precision. 

Assume we have an integer significand in \j3 n ~ 1 ,j3 n ) where n is the pre- 
cision, and an exponent in [e m i n ,e max ]. Write rj = /3 e " lin . The two smallest 
positive normalised numbers are x = f3 n ~ 1 n and y = (/3 n ~ l + l)rj. The dif- 
ference y — x equals rj, which is tiny compared to the difference between 
and x, which is fi n ~ x r}. In particular, y — x cannot be represented exactly 
as a normalised number, and will be rounded to zero in rounded to nearest 
mode. This has the unfortunate consequence that instructions like: 
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if (y <> x) then 

z = 1.0 / (y - x); 

will produce a "division by zero" error within 1.0 / (y - x) . 

Subnormal numbers solve this problem. The idea is to relax the condition 
j3 n ~ l < m for the exponent e m i n . In other words, we include all numbers of 
the form m-/? emin for 1 < m < /3 n_1 in the set of valid floating-point numbers. 
One could also permit m = 0, and then zero would be a subnormal number, 
but we continue to regard zero as a special case. 

Subnormal numbers are all integer multiples of 77, with a multiplier 1 < 
m < /3 n_1 . The difference between x = (3 n ~ l n and y = {j3 n ~ l + 1)77 is now 
representable, since it equals 77, the smallest positive subnormal number. 
More generally, all floating-point numbers are multiples of 77, likewise for 
their sum or difference (in other words, operations in the subnormal domain 
correspond to fixed-point arithmetic). If the sum or difference is non-zero, it 
has magnitude at least 77, thus cannot be rounded to zero. Thus the "division 
by zero" problem mentioned above does not occur with subnormal numbers. 

In the IEEE 754 double precision format — called binary 64 in the IEEE 
754 revision — the smallest positive normal number is 2 -1022 , and the small- 
est positive subnormal number is 2 -1074 . In arbitrary precision, subnormal 
numbers seldom occur, since usually the exponent range is huge compared 
to the expected exponents in a given application. Thus the only reason 
for implementing subnormal numbers in arbitrary precision is to provide an 
extension of IEEE 754 arithmetic. Of course, if the exponent range is un- 
bounded, then there is absolutely no need for subnormal numbers, because 
any nonzero floating-point number can be normalised. 

3.1.5 Encoding 

The encoding of a floating-point number x = (— l) s ■ m ■ f3 e is the way the 
values s, m and e are stored in the computer. Remember that j3 is implicit, 
i.e., is considered fixed for a given implementation; as a consequence, we do 
not consider here mixed radix operations involving numbers with different 
radices (3 and /?'. 

We have already seen that there are several ways to encode the significand 
m when (3 is not a power of two: in ba.se-/3 k or in binary. For normal numbers 
in radix 2, i.e., 2 n_1 < m < 2 n , the leading bit of the significand is necessarily 
1, thus one might choose not the encode it in memory, to gain an extra bit of 
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precision. This is called the implicit leading bit, and it is the choice made in 
the IEEE 754 formats. For example the double precision format has a sign 
bit, an exponent field of 11 bits, and a significand of 53 bits, with only 52 
bits stored, which gives a total of 64 stored bits: 



sign 
[1 bit) 



exponent 
(11 bits) 



significand 
(52 bits, plus implicit leading bit) 



A nice consequence of this particular encoding is the following. Let x be 
a double precision number, neither subnormal, ±oo, NaN, nor the largest 
normal number. Consider the 64-bit encoding of a; as a 64-bit integer, with 
the sign bit in the most significant bit, the exponent bits in the next signifi- 
cant bits, and the explicit part of the significand in the low significant bits. 
Adding 1 to this 64-bit integer yields the next double precision number to 
x, away from zero. Indeed, if the significand m is smaller than 2 53 — 1, m 
becomes m + 1 which is smaller than 2 53 . If m = 2 53 — 1, then the lowest 
52 bits are all set, and a carry occurs between the significand field and the 
exponent field. Since the significand field becomes zero, the new significand 
is 2 52 , taking into account the implicit leading bit. This corresponds to a 
change from (2 53 — 1) • 2 e to 2 52 • 2 e+1 , which is exactly the next number away 
from zero. Thanks to this consequence of the encoding, an integer compari- 
son of two words (ignoring the actual type of the operands) should give the 
same result as a floating-point comparison, so it is possible to sort normal 
floating-point numbers as if they were integers of the same length (64-bit for 
double precision). 

In arbitrary precision, saving one bit is not as crucial as in fixed (small) 
precision, where one is constrained by the word size (usually 32 or 64 bits). 
Thus, in arbitrary precision, it is easier and preferable to encode the whole 
significand. Also, note that having an "implicit bit" is not possible in radix 
(3 > 2, since for a normal number the most significant digit might take several 
values, from 1 to /3 — 1. 

When the significand occupies several words, it can be stored in a linked 
list, or in an array (with a separate size field). Lists are easier to extend, but 
accessing arrays is usually more efficient because fewer memory references 
are required in the inner loops. 

The sign s is most easily encoded as a separate bit field, with a non- 
negative significand. Other possibilities are to have a signed significand, 
using either l's complement or 2's complement, but in the latter case a 
special encoding is required for zero, if it is desired to distinguish +0 from 
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—0. Finally, the exponent might be encoded as a signed word (for example, 
type long in the C language). 

3.1.6 Precision: Local, Global, Operation, Operand 

The different operands of a given operation might have different precisions, 
and the result of that operation might be desired with yet another precision. 
There are several ways to address this issue. 

• The precision, say n is attached to a given operation. In this case, 
operands with a smaller precision are automatically converted to preci- 
sion n. Operands with a larger precision might either be left unchanged, 
or rounded to precision n. In the former case, the code implementing 
the operation must be able to handle operands with different precisions. 
In the latter case, the rounding mode to shorten the operands must be 
specified. Note that this rounding mode might differ from that of the 
operation itself, and that operand rounding might yield large errors. 
Consider for example a = 1.345 and b = 1.234567 with a precision of 4 
digits. If b is taken as exact, the exact value of a — b equals 0.110433, 
which when rounded to nearest becomes 0.1104. If b is first rounded to 
nearest to 4 digits, we get b' = 1.235, and a — b' = 0.1100 is rounded 
to itself. 

• The precision n is attached to each variable. Here again two cases may 
occur. If the operation destination is part of the operation inputs, as 
in sub (c, a, b), which means c <— o(a — 6), then the precision of 
the result operand c is known, thus the rounding precision is known in 
advance. Alternatively, if no precision is given for the result, one might 
choose the maximal (or minimal) precision from the input operands, or 
use a global variable, or request an extra precision parameter for the 
operation, as in c = sub (a, b, n). 

Of course, all these different semantics are non-equivalent, and may yield 
different results. In the following, we consider the case where each variable, 
including the destination variable, has its own precision, and no pre-rounding 
or post-rounding occurs. In other words, the operands are considered exact 
to their full precision. 

Rounding is considered in detail in £13.1.91 Here we define what we mean 
by the correct rounding of a function. 
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Definition 3.1.1 Let a,b, . . . be floating-point numbers, f be a mathematical 
function, n > 1 be an integer, and o a rounding mode. We say that c is the 
correct rounding of f(a, b, . . .), and we write c = °(/(a, b, ■■■)), if c is the 
floating-point number closest from f(a, b, . . .) according to the given rounding 
mode. (In case several numbers are at the same distance of f(a,b, . . .), the 
rounding mode must define in a deterministic way which one is "the closest".) 

3.1.7 Link to Integers 

Most floating-point operations reduce to arithmetic on the significands, which 
can be considered as integers as seen in the beginning of this section. There- 
fore efficient arbitrary precision floating-point arithmetic requires efficient 
underlying integer arithmetic (see Chapter [TJ. 

Conversely, floating-point numbers might be useful for the implementa- 
tion of arbitrary precision integer arithmetic. For example, one might use 
hardware floating-point numbers to represent an arbitrary precision integer. 
Indeed, since a double precision floating-point has 53 bits of precision, it can 
represent an integer up to 2 53 — 1, and an integer A can be represented as: 
A = a n _i/? n_1 + • • • + OiP 1 + • • • + ax/3 + a , where (3 = 2 53 , and the Oj are 
stored in double precision numbers. Such an encoding was popular when 
most processors were 32-bit, and some had relatively slow integer operations 
in hardware. Now that most computers are 64-bit, this encoding is obsolete. 

Floating-point expansions are a variant of the above. Instead of storing 
<2j and having (3 l implicit, the idea is to directly store aif3 l . Of course, this 
only works for relatively small i, i.e., whenever ai(3 l does not exceed the 
format range. For example, for IEEE 754 double precision and j3 = 2 53 , the 
maximal precision is 1024 bits. (Alternatively, one might represent an integer 
as a multiple of the smallest positive number 2~ 1074 , with a corresponding 
maximal precision of 2098 bits.) 

Hardware floating-point numbers might also be used to implement the 
Fast Fourier Transform (FFT), using complex numbers with floating-point 
real and imaginary part (see < J3.3.1D . 



3.1.8 Ziv's Algorithm and Error Analysis 

A rounding boundary is a point at which the rounding function o[x) is dis- 
continuous. 
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In fixed precision, for basic arithmetic operations, it is sometimes possi- 
ble to design one-pass algorithms that directly compute a correct rounding. 
However, in arbitrary precision, or for elementary or special functions, the 
classical method is to use Ziv's algorithm: 

1. we are given an input x, a target precision n, and a rounding mode; 

2. compute an approximation y with precision m > n, and a correspond- 
ing error bound e such that \y — f(x)\ < e; 

3. if [y — e, y + e] contains a rounding boundary, increase m and go to Step 
2; 

4. output the rounding of y, according to the given mode. 

The error bound e at Step 2 might be computed either a priori, i.e., from 
x and n only, or dynamically, i.e., from the different intermediate values 
computed by the algorithm. A dynamic bound will usually be tighter, but 
will require extra computations (however, those computations might be done 
in low precision). 

Depending on the mathematical function to be implemented, one might 
prefer an absolute or a relative error analysis. When computing a relative 
error bound, at least two techniques are available: one might express the 
errors in terms of units in the last place (ulps), or one might express them in 
terms of true relative error. It is of course possible in a given analysis to mix 
both kinds of errors, but in general one loses a constant factor — the radix 
/3 — when converting from one kind of relative error to the other kind. 

Another important distinction is forward vs backward error analysis. As- 
sume we want to compute y = f(x). Because the input is rounded, and/or 
because of rounding errors during the computation, we might actually com- 
pute y' ~ f(x'). Forward error analysis will bound \y' — y\ if we have a bound 
on \x' — x\ and on the rounding errors that occur during the computation. 

Backward error analysis works in the other direction. If the computed 
value is y' , then backward error analysis will give us a number 5 such that, 
for some x' in the ball \x' — x\ < 5, we have y' = f(x'). This means that the 
error is no worse than might have been caused by an error of 5 in the input 
value. Note that, if the problem is ill-conditioned, S might be small even if 
\y' — y\ is large. 
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In our error analyses, we assume that no overflow or underflow occurs, 
or equivalently that the exponent range is unbounded, unless the contrary is 
explicitly stated. 

3.1.9 Rounding 

There are several possible definitions of rounding. For example probabilistic 
rounding — also called stochastic rounding — chooses at random a rounding 
towards +oo or — oo for each operation. The IEEE 754 standard defines four 
rounding modes: towards zero, +oo, — oo and to nearest (with ties broken 
to even). Another useful mode is "rounding away", which rounds in the 
opposite direction from zero: a positive number is rounded towards +oo, 
and a negative number towards — oo. If the sign of the result is known, all 
IEEE 754 rounding modes might be converted to either rounding to nearest, 
rounding towards zero, or rounding away. 

Theorem 3.1.1 Consider a binary floating-point system with radix (3 and 
precisionn. Let u be the rounding to nearest of some realx, then the following 
inequalities hold: 

\u — x\ < - ulp(n) 

\u-x\ < -(5 l - n \u\ 

\u-x\ < -I3 l - n \x\. 
i ' — 2 ' ' 

Proof. For x = 0, necessarily u = 0, and the statement holds. Without loss 
of generality, we can assume u and x positive. The first inequality is the 
definition of rounding to nearest, and the second one follows from ulp(n) < 
j3 1 ~ n u. (In the case j3 = 2, it gives \u — x\ < 2~ n \u\.) For the last inequality, 
we distinguish two cases: if u < x, it follows from the second inequality. If 
x < u, then if x and u have the same exponent, i.e., /3 e_1 < x < u < (3 e , then 
| ulp(w) = |/? e_n < ^j3 1 ~ n x. The only remaining case is /? e_1 < x < u = (3 e . 
Since the floating-point number preceding (3 e is (3 e {l — (3~ n ), and x was 
rounded to nearest, we have \u — x\ < \(5 e ~ n here too. rj 

In order to round according to a given rounding mode, one proceeds as 
follows: 



Modern Computer Arithmetic, §3.1 95 

1. first round as if the exponent range was unbounded, with the given 
rounding mode; 

2. if the rounded result is within the exponent range, return this value; 

3. otherwise raise the "underflow" or "overflow" exception, and return ±0 
or ±oo accordingly. 

For example, assume radix 10 with precision 4, e max = 3, with x = 0.9234- 10 3 , 
y = 0.7656 • 10 2 . The exact sum x + y equals 0.99996 • 10 3 . With rounding 
towards zero, we obtain 0.9999 • 10 3 , which is representable, so there is no 
overflow. With rounding to nearest, x + y rounds to 0.1000 • 10 4 with an 
unbounded exponent range, which exceeds e max = 3, thus we get +oo as 
result, with an overflow. In this model, the overflow depends not only on 
the operands, but also on the rounding mode. This is consistent with IEEE 
754, which requires that a number larger or equal to (1 — 2 _ra )2 emax rounds 
to +oo. 

The "round to nearest" mode from IEEE 754 rounds the result of an 
operation to the nearest representable number. In case the result of an 
operation is exactly in the middle of two consecutive numbers, the one with 
its least significant bit zero is chosen (remember IEEE 754 only considers 
radix 2). For example I.IOII2 is rounded with a precision of 4 bits to I.IIO2, 
as is 1.11012- However this rule does not readily extend to an arbitrary radix. 
Consider for example radix p = 3, a precision of 4 digits, and the number 
1212. 111... 3 . Both 1212.3 and 1220. 3 end in an even digit. The natural 
extension is to require the whole significand to be even, when interpreted 
as an integer in [p n_1 ,/? n — 1]. In this setting, (1212.111 .. ,) 3 rounds to 
(1212)3 = 50i . (Note that p n is an odd number here.) 

Assume we want to correctly round to n bits a real number whose binary 
expansion is 2 e -0.16i . . . b n b n+ i ... It is enough to know the values of r = b n+ \ 
- called the round bit - - and that of the sticky bit s, which is when 
6 n+ 2& n +3 ••• is identically zero, and 1 otherwise. Table 13.11 shows how to 
correctly round given r, s, and the given rounding mode; rounding to ±00 
being converted to rounding to zero or away, according to the sign of the 
number. The entry u b n " is for round to nearest in the case of a tie: if b n = 
it will be unchanged, but if b n = 1 we add 1 (thus changing b n to 0). 

In general, we do not have an infinite expansion, but a finite approxima- 
tion y of an unknown real value x. For example, y might be the result of an 
arithmetic operation such as division, or an approximation to the value of a 
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l 


1 



Table 3.1: Rounding rules according to the round bit r and the sticky bit s: 



a "0" entry means truncate (round to zero), 
zero (add 1 to the truncated significand). 



'1" means round away from 



transcendental function such as exp. The following problem arises: given the 
approximation y, and a bound on the error \y — x\, is it possible to determine 
the correct rounding of xl Algorithm RoundingPossible returns true if 
and only if it is possible. 



Algorithm 39 RoundingPossible 



Input: a floating-point number y = 0.1y2 

bound e = 2~ k , a rounding mode o 
Output: true when o n [x) can be determined for \y — x\ < e 

if k < n + 1 then return false 

if o is to nearest then £ <— n + 1 else £ <— n 

if o is to nearest and yi = yi + \ then return true 

if y e+ i = yt +2 = ■ ■ ■ = Vk then return false 

return true. 



y m , a precision n < m, an error 



Proof. Since rounding is monotonic, it is possible to determine o(x) exactly 
when o[y — 2~ h ) = o(y + 2~ k ), or in other words when the interval [y — 2~ k ,y + 
2~ h ] contains no rounding boundary. The rounding boundaries for rounding 
to nearest in precision n are those for directed rounding in precision n + 1. 

If k < n + 1, then the interval \—2~ k ,2~ k ] has width at least 2~ n , thus 
contains one rounding boundary: it is not possible to round correctly. In case 
of rounding to nearest, if the round bit and the following bit are equal - 
thus 00 or 11 — and the error is after the round bit {k > n+ 1), it is possible 
to round correctly. Otherwise it is only possible when ye + i,ye + 2, ■ ■ ■ ,yk are 
not all identical. rj 
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The Double Rounding Problem 

When a given real value x is first rounded to precision m, then to precision 
n < m, we say that a "double rounding" occurs. The "double rounding 
problem" happens when this value differs from the direct rounding of x to 
the smaller precision n, assuming the same rounding mode is used in all 
cases: 

°n{°m(x)) ^ O n (x). 

The double rounding problem does not occur for the directed rounding 
modes. For those rounding modes, the rounding boundaries at the larger 
precision m refine those at the smaller precision n, thus all real values x that 
round to the same value y at precision m also round to the same value at 
precision n, namely ° n {y)- 

Consider the decimal value x = 3.14251. Rounding to nearest to 5 digits, 
one gets y = 3.1425; rounding y to nearest-even to 4 digits, one gets 3.142, 
whereas the direct rounding of x would give 3.143. 

With rounding to nearest mode, the double rounding problem only occurs 
when the second rounding involves the even-rule, i.e., the value y = o m (x) 
is a rounding boundary at precision n. Indeed, otherwise y is at distance at 
least one ulp (in precision m) of a rounding boundary at precision n, and 
since \y — x\ is bounded by half an ulp (in precision m), all possible values 
for x round to the same value in precision n. 

Note that the double rounding problem does not occur with all ways of 
breaking ties for rounding to nearest (Ex. 13. 2|) . 

3.1.10 Strategies 

To determine correct rounding of f(x) with n bits of precision, the best 
strategy is usually to first compute an approximation y of f(x) with a working 
precision of m = n + h bits, with h relatively small. Several strategies are 
possible in Ziv's algorithm ( A3.1.8J1 when this first approximation y is not 
accurate enough, or too close to a rounding boundary. 

• Compute the exact value of f(x), and round it to the target precision n. 
This is possible for a basic operation, for example f(x) = x 2 , or more 
generally f(x, y) = x + y or x x y. Some elementary functions may yield 
an exact representable output too, for example \/2.25 = 1.5. An "exact 
result" test after the first approximation avoids possibly unnecessary 
further computations. 
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• Repeat the computation with a larger working precision m! = n + 
h! . Assuming that the digits of f(x) behave "randomly" and that 
\f'(x)/f(x)\ is not too large, using hi m lgn is enough to guarantee 
that rounding is possible with probability 1 — O(-). If rounding is still 
not possible, because the hi last digits of the approximation encode 
or 2 h — 1, one can increase the working precision and try again. 
A check for exact results guarantees that this process will eventually 
terminate, provided the algorithm used has the property that it gives 
the exact result if this result is representable and the working precision 
is high enough. For example, the square root algorithm should return 
the exact result if it is representable (see Algorithm FPSqrt in § 13.51 
and also exercise 13 .3|) . 



3.2 Addition, Subtraction, Comparison 

Addition and subtraction of floating-point numbers operate from the most 
significant digits, whereas the integer addition and subtraction start from the 
least significant digits. Thus completely different algorithms are involved. In 
addition, in the floating-point case, part or all of the inputs might have no 
impact on the output, except in the rounding phase. 

In summary, floating-point addition and subtraction are more difficult to 
implement than integer addition/subtraction for two reasons: 

• scaling due to the exponents requires shifting the significands before 
adding or subtracting them. In principle one could perform all opera- 
tions using only integer operations, but this would require huge integers, 
for example when adding 1 and 2 -1000 . 

• as the carries are propagated from least to most significant digits, one 
may have to look at arbitrarily low input digits to guarantee correct 
rounding. 

In this section, we distinguish between "addition" , where both operands 
to be added have the same sign, and "subtraction", where the operands to 
be added have different signs (we assume a sign-magnitude representation). 
The case of one or both operands zero is treated separately; in the description 
below we assume that all operands are nonzero. 
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3.2.1 Floating-Point Addition 

Algorithm FPadd adds two binary floating-point numbers b and c of the 
same sign. More precisely, it computes the correct rounding of b + c, with 
respect to the given rounding mode o. For the sake of simplicity, we assume 
b and c are positive, b > c > 0. It will also be convenient to scale b and c so 
that 2 n ~ l < b < 2 n and 2 m_1 < c < 2 m , where n is the desired precision of 
the output, and m < n. Of course, if the inputs?) and c to Algorithm FPadd 
are scaled by 2 fc , then the output must be scaled by 2~ k . We assume that 
the rounding mode is to nearest, towards zero, or away from zero (rounding 
to ±oo reduces to rounding towards zero or away from zero, depending on 
the sign of the operands). 

Algorithm 40 FPadd 

Input: b > c > two binary floating-point numbers, a precision n such that 

2 n_1 < o < 2 n , and a rounding mode o. 
Output: a floating-point number a of precision n and scale e such that 
a-2 e = o(b + c) 
1: Split b into bh + be where bh contains the n most significant bits of b. 
2: Split c into Ch + Q where Ch contains the most significant bits of c, and 
ulp(c h ) = ulp(fe^). 



3 


a h ^b h + c h , e <- 


4 


(c, r, s) <- 6^ + a 


5 


(a, t) <— a/j + c + round(o, r, s) 


6 


e^O 


7 


if a > 2 n then 


8 


a <— round2(o, a mod 2, £), e <— e + 1 


9 


if a = 2 n then (a, e) «- (a/2, e + 1) 





return (a, e). 



The values of round (o, r,s) and round2(o, a mod 2, t) are given in Ta- 
ble 13.21 At step HJ the notation (c, r, s) <— bi + q means that c is the carry 
bit of bi + q, r the round bit, and s the sticky bit: c, r, and s are in {0, 1}. 
For rounding to nearest, t is a ternary value, which is respectively positive, 
zero, or negative when a is smaller than, equal to, or larger than the exact 
sum b + c. 

Theorem 3.2.1 Algorithm FPadd is correct. 
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o 


r 


s 


round (o, r,s) i 


l , := sign(fr + c - 


- a) 


zero 


any 


any 









away 


any 


any 


if r = s = 0, 1 otherwise 






nearest 





any 





+s 




nearest 


1 





0/1 (even rounding) 


+V-1 




nearest 


1 


7^0 


1 


-1 





a mod 2 t 



round2(o, a mod 2, t) 



any 
zero 

away 
nearest 
nearest 




1 
1 
1 
1 



any 





±1 



a/2 
(o-l)/2 
(o + l)/2 

(a — l)/2 if even, (a + l)/2 otherwise 
(a + t)/2 



Table 3.2: Rounding rules for addition. 

Proof. Without loss of generality, we can assume that 2 n_1 < b < 2 n and 
2 m_1 < c < 2 m , with m < n. With this assumption, 6^ and Ch are the 
integer parts of b and c, bi and q their fractional parts. Since b > c, we have 
^ < &h and 2 n ~ l < b h < 2 n - 1, thus 2 n_1 < a h < 2 n+1 - 2, and at stepEl 
2 n_1 < a < 2 n+1 . If a < 2 n , a is the correct rounding of b + c. Otherwise, we 
face the "double rounding" problem: rounding a down to n bits will give the 
correct result, except when a is odd and rounding is to nearest. In that case, 
we need to know if the first rounding was exact, and if not in which direction 
it was rounded; this information is represented by the ternary value t. After 
the second rounding, we have 2 n_1 < a < 2 n . rj 

Note that the exponent e a of the result lies between e& (the exponent of 6, 
here we considered the case e& = n) and e\, + 2. Thus no underflow can occur 
in an addition. The case e a = e\, + 2 can occur only when the destination 
precision is less than that of the operands. 



3.2.2 Floating-Point Subtraction 

Floating-point subtraction is very similar to addition; with the difference 
that cancellation can occur. Consider for example the subtraction 6.77823 — 
5.98771. The most significant digit of both operands disappeared in the 
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result 0.79052. This cancellation can be dramatic, as in 6.7782357934 — 
6.7782298731 = 0.0000059203, where six digits were cancelled. 

Two approaches are possible, assuming n result digits are wanted, and 
the exponent difference between the inputs is d. 

• Subtract from the most n significant digits of the larger operand (in 
absolute value) the n — d significant digits of the smaller operand. If 
the result has n — e digits with e > 0, restart with n + e digits from 
the larger operand and (n + e) — d from the smaller one. 

• Alternatively, predict the number e of cancelled digits in the subtrac- 
tion, and directly subtract the (n + e) — d most significant digits of the 
smaller operand from the n + e most significant digits of the larger one. 

Note that in the first case, we might have e = n, i.e., all most significant 
digits cancel, thus the process might need to be repeated several times. 

The first step in the second phase is usually called leading zero detection. 
Note that the number e of cancelled digits might depend on the rounding 
mode. For example 6.778 — 5.7781 with a 3-digit result yields 0.999 with 
rounding toward zero, and 1.00 with rounding to nearest. Therefore in a real 
implementation, the exact definition of e has to be made more precise. 

Finally, in practice one will consider n + g and (n + g) — d digits instead of 
n and n — d, where the g "guard digits" will prove useful (i) either to decide 
of the final rounding, (ii) and/or to avoid another loop in case e < g. 

Sterbenz's Theorem 

Sterbenz's Theorem is an important result concerning floating-point subtrac- 
tion (of operands of same sign). It states that the rounding error is zero in 
some important cases. More precisely: 

Theorem 3.2.2 (Sterbenz) If x and y are two floating-point numbers of 
same precision n, such that y lies in the interval [x/2, 2x], then y — x is exactly 
representable in precision n. 

Proof. The case x = y = is trivial, so assume that x ^ 0. Since y lies 
in [x/2,2x], x and y must have the same sign, We assume without loss of 
generality that x and y are positive. 

Assume x < y < 2x (the same reasoning applies for x/2 < y < x, 
i.e., y < x < 2y, which is symmetric in x,y). Then since x < y, we have 
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ulp(x) < u\p(y), thus y is an integer multiple of ulp(x). It follows that y — x is 
an integer multiple of ulp(x) too, and since 0<y — x<x,y — xis necessarily 
representable with the precision of x. rj 

It is important to note that Sterbenz's Theorem applies for any radix /3; the 
constant 2 in [x/2, 2x] has nothing to do with the radix. 



3.3 Multiplication 

Multiplication of floating-point numbers is called a short product. This re- 
flects the fact that in some cases, the low part of the full product of the 
significands has no impact — except maybe for the rounding — on the final 
result. Consider the multiplication xy, where x = £ ■ j3 e , and y = m ■ 0* . 
Then o(x • y) = o(£ ■ m)j3 e+ f , thus it suffices to consider the case where x = £ 
and y = m are integers, and the product is rounded at some weight j3 9 for 
g > 0. Either the integer product £ ■ m is computed exactly, using one of the 
algorithms from Chapter and then rounded; or the upper part is computed 
directly using a "short product algorithm", with correct rounding. The 
different cases that can occur are depicted in Fig. 13.11 

An interesting question is: how many consecutive identical bits can occur 
after the round bit? Without loss of generality, we can rephrase this question 
as follows. Given two odd integers of at most n bits, what is the longest run of 
identical bits in their product? (In the case of an even significand, one might 
write it m = £-2 e with £ odd.) There is no a priori bound except the trivial one 
of 2n — 2 for the number of zeros, and 2n — l for the number of ones. Consider 
with a precision 5 bits for example, 27 x 19 = (1000000001)2- More generally 
such a case corresponds to a factorisation of 2 2n_1 + 1 into two integers of n 
bits, for example 258513 x 132913 = 2 35 + 1. For consecutive ones, the value 
2n is not possible since 2 2n — 1 cannot factor into two integers of at most n 
bits. Therefore the maximal runs have 2n — 1 ones, for example 217 x 151 = 
(111111111111111)2 for n = 8. A larger example is 849583x647089 = 2 39 -l. 

The exact product of two floating-point numbers m ■ f3 e and m' ■ (3 e is 
{mm') ■ (3 e+e . Therefore, if no underflow or overflow occurs, the problem 
reduces to the multiplication of the significands m and m'. See Algorithm 
FPmultiply. 

The product at step [T] of FPmultiply is a short product, i.e., a product 
whose most significant part only is wanted, as discussed at the start of this 
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Figure 3.1: Different multiplication scenarios, according to the input and 
output precisions. The rectangle corresponds to the full product of the inputs 
x and y (most significand digits bottom left), the triangle to the wanted short 
product. Case (a), no rounding is necessary, the product being exact; case 
(b): the full product needs to be rounded, but the inputs should not be; case 
(c): the input with the larger precision might be truncated before performing 
a short product; case (d): both inputs might be truncated. 



Section. In the quadratic range, it can be computed in about half the time of 
a full product. In the Karatsuba and Toom-Cook ranges, Mulders' algorithm 
can gain 10% to 20%; however due to carries, using this algorithm for floating- 
point computations is tricky. Lastly, in the FFT range, no better algorithm 
is known than computing the full product mm', and then rounding it. 

Hence our advice is to perform a full product of m and m! ', possibly after 
truncating them to n + g digits if they have more than n + g digits. Here g 
(the number of guard digits) should be positive (see Exercise 13 .4|) . 

It seems wasteful to multiply n-bit operands, producing a 2n-bit product, 
only to discard the low-order n bits. Algorithm ShortProduct computes 
an approximation to the short product without computing the 2n-bit full 
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Algorithm 41 FPmultiply 

Input: x = m ■ (3 e , x' = ml ■ (3 e , a precision n, a rounding mode o 
Output: o[xx') rounded to precision n 

1: m" <— o(mm') rounded to precision n 

2: return m" ■ (3 e+e . 



product. 

Error analysis of the short product. Consider two n-word normalised 
significands A and B that we multiply using a short product algorithm, where 
the notation FullProduct(A, B,n) means the full integer product A ■ B. 

Algorithm 42 ShortProduct 

Input: integers A, B, and n, with < A, B < j3 n 

Output: an approximation of AB div j3 n 

if n < no then return FullProduct(A, B) div j3 n 

choose k > n/2, £ <— n — k 

d «- FullProduct(A div (3 e , B div (3 £ ) div P k ~ l 

C 2 «- ShortProduct (A mod (3 l , B div /? fc , £) 

C 3 «- ShortProduct (A div /? fc , 5 mod /^, £) 

return d + C 2 + C3. 



Theorem 3.3.1 T/ie wa/ne C" returned by Algorithm ShortProduct differs 
from the exact short product C = AB div j3 n by at most 3(n — 1): 

C <C <C' + 3(n-l). 



Proof. First since A, B are nonnegative, and all roundings are truncations, 
the inequality C < C easily follows. 

Let A = ^2 i aiP t and B = Yljbjfti where < a i} bj < j3. The possible 
errors come from: (i) the neglected a t bj terms, i.e., parts C' 2 , C 3 , C4 of Fig. \3~2\ 
(ii) the truncation while computing C\, (hi) the error in the recursive calls 
for C2 and C3. 

We hrst prove that the algorithm accumulates all products aibj with 
i + j > n — 1. This corresponds to all terms on and below the diagonal 
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C 2 


c 4 


.4 




c. 




C 


L 





B 



Figure 3.2: Graphical view of Algorithm ShortProduct: the computed 
parts are C\, C2, C3, and the neglected parts are C 2 , C 3 , C4 (most significant 
part bottom left). 

in Fig. 13.21 The most significant neglected terms are the bottom-left terms 
from C2 and C3, respectively ai-\bk-\ and ak-ibg-i. Their contribution is at 
most 2(/3 — \) 2 j3 n ~ 2 . The neglected terms from the next diagonal contribute 
to at most 4(/3 — l) 2 /? 71-3 , and so on. The total contribution of neglected 
terms is thus bounded by: 

(J3 - 1) 2 /T[2/T 2 + 4/3- 3 + 6/T 4 + •••]< 2/T 

(the inequality is strict since the sum is finite). 

The truncation error in C\ is at most (3 n , thus the maximal difference 
e(n) between C and C satisfies: 

e(n) <3 + 2 £ (Ln/2j), 

which gives e(n) < 3(n — 1), since e(l) = 0. rj 

Question: is the upper bound C + (n — 1) attained? Can the theorem be 
improved? 

Remark: if one of the operands was truncated before applying algorithm 
ShortProduct, simply add one unit to the upper bound (the truncated part 
is less than 1, thus its product by the other operand is bounded by /3 n ). 



3.3.1 Integer Multiplication via Complex FFT 

To multiply n-bit integers, it may be advantageous to use the Fast Fourier 
Tranform (FFT for short, see i jl.3.4[) . Note that the FFT computes the cyclic 
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convolution z = x * y denned by 

Zk = } y x jVk-j mod TV for < k < N. 

0<j<N 

In order to use the FFT for integer multiplication, we have to pad the input 
vectors with zeros, thus increasing the length of the transform from iV to 
2N. 

FFT algorithms fall into two classes: those using number theoretical prop- 
erties, and those based on complex floating-point computations. The latter, 
while not always giving the best known asymptotic complexity, have good 
practical behaviour, because they take advantage of the efficiency of floating- 
point hardware. The drawback of the complex floating-point FFT (complex 
FFT for short) is that, being based on floating-point computations, it re- 
quires a rigorous error analysis. However, in some contexts where occasional 
errors are not disastrous, one may accept a small probability of error if this 
speeds up the computation. For example, in the context of integer factorisa- 
tion, a small probability of error is acceptable because the result (a purported 
factorisation) can easily be checked and discarded if incorrect. 

The following theorem provides a tight error analysis of the complex FFT: 

Theorem 3.3.2 The FFT allows computation of the cyclic convolution z = 
x *y of two vectors of length N = 2 n of complex values such that 

\\z' - zWoo < ||a;|| • |M| • ((1 + e) 3n (l +sV5) 3n+1 (l + /i) 3n - 1), (3.2) 

where || ■ || and || ■ ||oo denote the Euclidean and infinity norms respectively, e 
is such that |(a±6)'— (o±6)| < e|o±6|, \(ab)' — (ab)\ < e\ab\ for all machine 
floats a, b, u. > \(w k )' — (w k )\, < k < N, w = e^ , and (•)' refers to the 
computed (stored) value of ■ for each expression. 

For the IEEE 754 double precision format, with rounding to nearest, we 
have e = 2 -53 , and if the w k are correctly rounded, we can take [i = e/y2. 
For a fixed FFT size N = 2 n , Eq. (|3.2J1 enables one to compute a bound 
B on the coefficients of x and y, such that \\z' — 2||oo < 1/2, which enables 
one to uniquely round the coefficients of z' to an integer. Table 13.31 gives 
b = lg B, the number of bits that can be used in a 64-bit floating-point word, 
if we wish to perform m-bit multiplication exactly (in fact m = 2 n ~ l b). It 
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n 


b 


m 


2 


24 


48 


3 


23 


92 


4 


22 


176 


5 


22 


352 


6 


21 


672 


7 


20 


1280 


8 


20 


2560 


9 


19 


4864 


10 


19 


9728 



n 


b 


m 


11 


18 


18432 


12 


17 


34816 


13 


17 


69632 


14 


16 


131072 


15 


16 


262144 


16 


15 


491520 


17 


15 


983040 


18 


14 


1835008 


19 


14 


3670016 


20 


13 


6815744 



Table 3.3: Maximal number b of bits per IEEE 754 double-precision floating- 
point number binary64 (53-bit significand), and maximal m for a plain mxm 
bit integer product, for a given FFT size 2", with signed coefficients. 

is assumed that the FFT is performed with signed coefficients in the range 
[_ 2 fe-i ?+2 fe-i) _ S ee jHSl pg. 161]. 

Note that Theorem 13.3.21 is a worst-case result; with rounding to nearest 
we expect the error to be smaller due to cancellation - see Exercise 13.91 



3.3.2 The Middle Product 

Given two integers of 2n and n bits respectively, their "middle product" con- 
sists of the n middle bits of their product (see Fig l3.3|) . The middle product 




Figure 3.3: The middle product of y of 2n bits and x of n bits is the middle 
region. 



might be computed using two short products, one (low) short product be- 
tween the high part of y and x, and one (high) short product between the low 
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part of y and x. However there are algorithms to compute a 2n x n middle 
product with the same M(n) complexity as an n x n full product (see < J3.8|) . 
Several applications may benefit from an efficient middle product. One 
of those applications is Newton's method f ^4.2|) . Consider for example the 
reciprocal iteration f< M.2.2[) Xj + \ = Xj + Xj(l — Xjy). If Xj has n bits, to get 
2n accurate bits in Xj+i, one has to consider 2n bits from y. The product Xjy 
then has 3n bits, but if Xj is accurate to n bits, the n most significant bits 
of Xjy cancel with 1, and the n least significant bits can be ignored in this 
iteration. Thus what is wanted is exactly the middle product of Xj and y. 



Payne and Hanek Argument Reduction 

Another application of the middle product is Payne and Hanek argument 
reduction. Assume x = m ■ 2 e is a floating-point number with a significand 
| < m < 1 of n bits and a large exponent e (say n = 53 and e = 1024 to 
fix the ideas). We want to compute sinx with a precision of n bits. The 
classical argument reduction works as follows: first compute k = [x/ir~\ , then 
compute the reduced argument 



x 



kir. 



(3.3) 



About e bits will be cancelled in the subtraction x — (kir), thus we need to 
compute kir with a precision of at least e + n bits to get an accuracy of at 
least n bits for x' . Assuming 1/tt has been precomputed to precision e, the 
computation of k costs M(e,n), and the multiplication kn costs M(e + n), 
thus the total cost is about M(e) when e»n. The key idea of Payne and 

1/tt 




Figure 3.4: A graphical view of Payne and Hanek algorithm. 
Hanek algorithm is to rewrite Eq. (J3.3J) as follows: 



x 



IT 



TV 



k). 



(3.4) 



If the significand of x has n <^i e bits, only about 2n bits from the expansion 
of 1/tt will effectively contribute to the n most significant bits of x' ', namely 
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the bits of weight 2~ e ~ n to 2~ e+n . Let y be the corresponding 2n-bit part 
of 1/tt. Payne and Hanek's algorithm works as follows: first multiply the 
n-bit significand of x by y, keep the n middle bits, and multiply by a n-bit 
approximation of it. The total cost is M(2n,n) + M(n), or even 2M(n) if 
the middle product is performed in time M(n). 

3.4 Reciprocal and Division 

As for integer operations f< jl.4|) . one should try as much as possible to trade 
floating-point divisions for multiplications, since the cost of a floating-point 
multiplication is theoretically smaller than the cost for a division by a con- 
stant factor (usually 2 up to 5 depending on the algorithm used). In practice, 
the ratio might not even be constant, as some implementations provide divi- 
sion with cost Q(M(n) logn) or G(n 2 ). 

When several divisions have to be performed with the same divisor, a well- 
known trick is to first compute the reciprocal of the divisor ( fc!3.4.1Jl : then each 
division reduces to a multiplications by the reciprocal. A small drawback is 
that each division incurs two rounding errors (one for the reciprocal and 
one for multiplication by the reciprocal) instead of one, so we can no longer 
guarantee a correctly rounded result. For example, in base ten with six digits, 
3.0/3.0 might evaluate to 0.999999 = 3.0 x 0.333333. 

The cases of a single division, or several divisions with a varying divisor, 
are considered in § 13.4.21 

3.4.1 Reciprocal 

We describe here algorithms that compute an approximate reciprocal of a 
positive floating-point number a, using integer-only operations (see Chap- 
ter P). Those integer operations simulate floating-point computations, but 
all roundings are made explicit. The number a is represented by an inte- 
ger A of n words in radix j3: a = /3~ n A, and we assume /3 n /2 < A, such 
that 1/2 < a < 1. (This does not cover all cases for (5 > 3, however if 
pn-i < ^4 < f3 n /2, multiplying A by some appropriate integer k < (3 will 
reduce to that case, then it suffices to multiply by k the reciprocal of ka.) 

We first perform an error analysis of Newton's method f< J4.2|) assuming 
all computations are done with infinite precision, thus neglecting roundoff 
errors. 
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Lemma 3.4.1 Let 1/2 < a < 1, p = 1/a, x > 0, and x' = x + x(l — ax). 
Then: 

0<p-x' <^(p-x) 2 , 
for some 9 £ [min(:r,p),max(x,p)]. 

Proof. Newton's iteration is based on approximating the function by its 
tangent. Let f(t) = a—l/t, with p the root of /. The second-order expansion 
of / at t = p with explicit remainder is: 

m = f{x) + (p- x)f\x) + ^^-no), 

for some 9 e [min(x, p),max(x,p)]. Since f(p) = 0, this simplifies to 

o-x f{x) {P ~ X? f " {9) (3 5) 

P ~ W) 2 f>(x) ■ (3 - 5) 

Substituting f(t) = a- 1/t, f'(t) = l/t 2 and f"(t) = -2/t 3 , it follows that: 

x 2 
p = x + x(l — ax) + —rip — x) 2 , 

9 A 

which proves the claim. q 

Algorithm ApproximateReciprocal computes an approximate recipro- 
cal. The input A is assumed to be normalised, i.e., /3 n /2 < A < (3 n . The 
output integer X is an approximation to (3 2n /A. 

Lemma 3.4.2 If (3 is a power of two satisfying f3 > 8, the output X of 
Algorithm ApproximateReciprocal satisfies: 

AX < f3 2n < A(X + 2). 

Proof. For n < 2 the algorithm returns X = \_^j-\ , except when A = (3 n /2 
where it returns X = 2j3 n — 1. In all cases we have AX < (3 2n < A(X + 1), 
thus the lemma holds. 

Now consider n > 3. We have £ = L^^J an d h = n — £, thus n = h + £ 
and h > £. The algorithm first computes an approximate reciprocal of the 
upper h words of A, and then updates it to n words using Newton's iteration. 
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Algorithm 43 ApproximateReciprocal 

Input: A = ^27=0 a $ % i with < a, < (3 and /3/2 < a n _ 1 . 
Output: X = (3 n + J2"=o XiP 1 with < x t < (3. 

1: if n < 2 then return [/? 2n /A] - 1 

2: £<- L^J, h^n-e 

3: ^fc^-Eto^+i^ 

4: Xft <— ApproximateReciprocal (^4^) 
5: T <- AX h 

6: while T > /T +h do 

7: (X^,T)^(A h -l,r-A) 

8: T <- (3 n+h - T 
9: T m <- LT/5" £ J 

10: [/ <- T m X h 

11: return X,^ + Lt/^" 2/l J. 



After the recursive call at line El we have by induction 

A h X h <(3 2h <A h (X h + 2). (3.6) 

After the product T <— AX^ and the while-loop at steps 130 we still have 
T = AXh, where T and Xh may have new values, and in addition T < j3 n+h . 
We also have j3 n+h < T + 2A; we prove the latter by distinguishing two cases. 
Either we entered the while-loop, then since the value of T decreases by A at 
each loop, the previous value T + A was necessarily larger or equal to j3 n+h . 
If we didn't enter the while-loop, the value of T is the original one T = Al/,. 
Multiplying Eq. (03J> by (3 e gives: (3 n+h < A h (3 e (X h + 2) < A(X h + 2) = 
T + 2A We thus have: 

T < (3 n+h <T + 2A. 

It follows T > [3 n+h -2A > /3 n+h - 2(3 n . As a consequence, the value of 
pn+h _ rp com p U t ec [ a t step |H1 cannot exceed 2/3 n — 1. The last lines compute 
the product T m Xh, where T m is the upper part of T, and put its £ most 
significant words in the low part Xt of the result X. 

Now let us perform the error analysis. Compared to Lemma 13.4.11 x 
stands for Xh/3~ h , a stands for Aj3~ n , and x' stands for Xj3~ n . The while- 
loop ensures that we start from an approximation x < l/a, i.e., AXh < j3 n+h . 
Then Lemma 13.4.11 guarantees that x < x' < l/a if x' is computed with 
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infinite precision. Here we have x < x', since X = Xhj3 h + Xp, where Xp > 0. 
The only differences with infinite precision are: 

• the low £ words from 1 — ax -- here T at line [H] — are neglected, and 
only its upper part (1 — ax)h — here T m — is considered; 

• the low 2h — £ words from x(l — ax)h are neglected. 

Those two approximations make the computed value of x' smaller or equal 
to the one which would be computed with infinite precision, thus we have for 
the computed value x': 

x < x' < 1/a. 

The mathematical error is bounded from Lemma 13.4.11 bv |g-(p — x) 2 < 
4/3~ 2h since |j < 1 and \p — x\ < 2j3~ h . The truncation from 1 — ax, which is 
multiplied by x < 2, produces an error less than 2(3~ 2h . Finally the truncation 
of x(l — ax)^ produces an error less than j3~ n . The final result is thus: 

x' <p<x' + 6(3~ 2h + I3~ n - 

Assuming Qj3~ 2h < /3~ n , which holds as soon as j3 > 6 since 2h > n, this 
simplifies to: 

x' < p < x' + 2/T n , 

which gives with x' = Xj3~ n and p = j3 n / A: 

X < -— <X + 2. 

~ A 

Since (3 is assumed to be a power of two, equality can hold only when A is 
a power of two itself, i.e., A = /3 n /2. In that case there is only one value 
of Xh that is possible for the recursive call, namely Xh = 2j3 h — 1. In that 
case T = j3 n+h — f3 n /2 before the while-loop, which is not entered. Then 
pn+h _ T = pnj 2 ^ which mu i ti pi iec i b y x h gives (again) (3 n+h - (3 n /2, whose 

h most significant words are j3 — 1. Thus X% = (3 — 1, and X = 2j3 n — 1: 
equality does not occur either in that case. rj 

Remark. The Lemma might be extended to the case /3 n_1 < A < j3 n or 
to a radix /? which is not a power of two. However we prefer to state this 
restricted Lemma with simple bounds. 
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Complexity Analysis. Let I(n) be the cost to invert an n-word number 
using Algorithm ApproximateReciprocal. If we neglect the linear costs, 
we have I(n) m I (n/2) + M(n,n/2) + M(n/2), where M(n,n/2) is the cost 
of an n x (n/2) product -- the product AXh at stepEJ — and M(n/2) the 
cost of an (n/2) x (n/2) product - - the product T m Xh at step E3 If the 
n x (n/2) product is performed via two (n/2) x (n/2) products, we have 
I(n) « I (n/2) + 3M(n/2), which yields M{n) in the quadratic range, |M(n) 
in the Karatsuba range, ~ 1.704M(n) in the Toom-Cook 3- way range, and 
3M(n) in the FFT range. In the FFT range, an n x (n/2) product might be 
directly computed by an FFT of length 3n/2 words, which therefore amounts 
to M(3n/4); in that case the complexity decreases to 2.5M(n). 

The wrap-around trick. We know describe a slight modification of 
Algorithm ApproximateReciprocal which yields a complexity 2M(n). In 
the product AXh at step |5j Eq. ()3.6|) tells that the result approaches (3 n+h , 
more precisely: 

P n+h - 2/T < AX h < P n+h + 2/T. (3.7) 

Assume we use an FFT algorithm such as the Schonhage-Strassen algo- 
rithm that computes products modulo (3 m +l, for some integer m G (n, n+h). 
Let AX h = U(3 m + V with < V < (3 rn . It follows from Eq. dXTJ) that 
U = /3^+h-m or jj = pn+h-m _ ± L et T = AX h mod {(3 m + 1) be the value 

computed by the FFT. We have T = V-U or T = V-U+{(3 m + l). It follows 
that AX h = T + U((3 m + 1) or AX h = T + (U - l){(3 m + 1). Taking into ac- 
count the two possible values of U, we have AX^ = T + ^ n+h ~ m — e) ((3 m + 1) 
where e G {0, 1, 2}. Since (3 > 6, j3 m > 4:j3 n , thus only one value of e yields a 
value of AX h in the interval [(3 n+h - 2{J n , fj n+h + 2(3 n ]. 

We thus replace step HI in Algorithm ApproximateReciprocal by the 
following code: 

Compute T = AXh mod (j3 m + 1) using an FFT with m > n 
T ^ T + (3 n+h + (3 n + h ~ m { case e = } 

while T > (3 n+h + 2/? n do 
T <- T - (/? m ■ + 1) 

Assuming one can take m close to n, the cost of the product AXh is only 
about that of an FFT of length n, that is M(n/2). 
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3.4.2 Division 

In this section we consider the case where the divisor changes between suc- 
cessive operations, so no precomputation involving the divisor can be per- 
formed. We first show that the number of consecutive zeros in the result is 
bounded by the divisor length, then we consider the division algorithm and 
its complexity. Lemma 13.4.31 analyses the case where the division operands 
are truncated, because they have a larger precision than desired in the re- 
sult. Finally we discuss "short division" and the error analysis of Barrett's 
algorithm. 

A floating-point division reduces to an integer division as follows. Assume 
dividend a = £ ■ j3 e and divisor d = m ■ (3*, where £, m are integers. Then 
2 = ^{3 e ~f . If k bits of the quotient are needed, we first determine a scaling 
factor g such that (3 k ~ 1 < \ — \ < (3 k , and we divide £j3 9 - - truncated if 
needed - - by m. The following theorem gives a bound on the number of 
consecutive zeros after the integer part of the quotient of \_£(3 9 \ by m. 

Theorem 3.4.1 Assume we divide an m-digit positive integer by an n-digit 
positive integer in radix j3, with m > n. Then the quotient is either exact, 
or its radix j3 expansion admits at most n — \ consecutive zeros or ones after 
the digit of weight j3°. 

Proof. We first consider consecutive zeros. If the expansion of the quotient 
q admits n or more consecutive zeros after the binary point, we can write 
q = q\ + (3~ n qo, where q\ is an integer and < go < 1- If <?o = 0, then 
the quotient is exact. Otherwise, if a is the dividend and d is the divisor, 
one should have a = q^d + /3~ n qod. However, a and q\d are integers, and 
< j3~ n q d < 1, so j3~ n q d cannot be an integer, so we have a contradiction. 
For consecutive ones, the proof is similar: write q = q\ — /3~ n qo, with 
< qo < I. Since d < (3 n , we still have < (3~ n qod < 1. rj 

The following algorithm performs the division of two n-digit floating-point 
numbers. The key idea is to approximate the inverse of the divisor to half 
precision only, at the expense of additional steps. (At step HI the notation 
MP(q , d) denotes the middle product of q and d, i.e., the n/2 middle digits 
of that product.) At step EJ r is an approximation to 1/di, and thus to 
1/d, with precision n/2 digits. Therefore at step EH qo approximates c/d to 
about n/2 digits, and the upper n/2 digits of q$d at step H] agree with those 
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Algorithm 44 Divide 

Input: n-digit floating-point numbers c and d, with n even 

Output: an approximation of c/d 

l: Write d = d x p n l 2 + d with < d u d < (3 n/2 

2: r <— ApproximateReciprocal(di,n/2) 

3: qo <— cr truncated to n/2 digits 

4: e^MP(q ,d) 

5: q<- q -re 



of c. The value e computed at step 0] thus equals q$d — c to precision n/2. 
It follows that re « e/rf agrees with q ~~ c /<^ to precision n/2; hence the 
correction term added in the last step. 

In the FFT range, the cost of Algorithm Divide is |M(n): step El costs 
2M (n/2) « M(n) with the wrap-around trick, and steps OE each cost 
M(n/2) - - using a fast middle product algorithm for step |U By way of 
comparison, if we computed a full precision inverse as in Barrett's algorithm 
(see below), the cost would be ^M(n). 

In the Karatsuba range, Algorithm Divide costs |ilf(n), and is useful 
provided the middle product of step 0] is performed with cost M(n/2). In 
the quadratic range, Algorithm Divide costs 2M(n), and a classical division 
should be preferred. 

When the requested precision for the output is smaller than that of the 
inputs of a division, one has to truncate the inputs, in order to avoid some 
unnecessarily expensive computation. Assume for example that one wants 
to divide two numbers of 10, 000 bits, with a 10-bit quotient. To apply the 
following lemma, just replace j3 by an appropriate value such that A\ and 
B\ have about 2n and n digits respectively, where n is the wanted number 
of digits for the quotient; for example one might have fi = f3 k to truncate k 
words. 

Lemma 3.4.3 Let A and B be two positive integers, and [i > 2 a positive 
integer. Let Q = [A/B\, A, = [A/u.\, B 1 = [B/u.\, Q x = LAi/£iJ. // 
A/B < 2B 1 , then 

Q <Qi <Q + 2. 

The condition A/B < 2B\ is quite natural: it says that the truncated divisor 
B\ should have essentially at least as many digits as the wanted quotient. 
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Proof. Let A t 



A 
B 



QiBt- 

Am 



R t . We have A = 

A 



Bm + b c 



A < A lf i 



Sijw 



= Qi- 



A , B = B lf i 

Rip + a 

Bui 



Bo, thus 



Since Ri < B\ and Aq < /j,, Rifi + A < B\\x, thus A/B < Qi + 1. Taking 
the floor of each side proves, since Q\ is an integer, that Q <Q\. 

Now consider the second inequality. For given truncated parts A\ and 
Bi, and thus given Qi, the worst case is when A is minimal, say A = Ai/3, 
and B is maximal, say B = Bi/3 + (j3 — 1). In this case we have: 



B 1 



A 
B 



B x 



A x (3 



B 1 p + (f3-r 



Atf-l) 



B 1 {B 1 p + ^-l) 



The numerator equals A — A\ < A, and the denominator equals B\B, thus 
the difference Ai/B 1 — A/B is bounded by A/{B\B) < 2, and so is the 
difference between Q and Q\. rj 

The following algorithm is useful in the Karatsuba and Toom-Cook range. 
The key idea is that, when dividing a 2n-digit number by an n-digit number, 
some work that is necessary for a full 2n-digit division can be avoided (see 
Fig.HSJ). 



Algorithm 45 ShortDivision 



Input: < A < p 2n , (3 n /2 < B < f3 n 
Output: an approximation of A/B 

if n < no then return [vl/i3j 

choose k > n/2, £ <— n — k 

{A u A ) <- (A div (3 2e , A mod (3 2e ) 

(Bi, B ) <- (B div e , B mod p l ) 

{Qi,Ri) ^DivRem(A 1 ,B 1 ) 

A' <- R t p 2e + A - QtBoP 1 

Qo ^ShortDivision(^' div j3 k , B div f3 k ) 

return Qi/3 e + Q . 



Theorem 3.4.2 The approximate quotient Q' returned by ShortDivision 

differs at most by 2lgn from the exact quotient Q = \_A/B\, more precisely: 



<Q' <Q + 2lgn. 
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Proof. If n < n , Q = Q' so the statement holds. Assume n > n . We 
have A = A 1 (3 2e + A and B = B 1 /3 i + B , thus since A x = Q\B X + R u 
A = (Q 1 B 1 +R 1 )(3 2£ +A = Q 1 B[3 e +A', with A' < (J n+e . Let A' = A'^+A'^ 
and B = B'^+B'q, with < A' , B' < {3 k , and A[ < p 2e . From LemmaEXH 
the exact quotient of A' div j3 k by B div /3 k is greater or equal to that of A' 
by B, thus by induction Q > A'/B. Since A/B = Qif3 l + A'/-B, this proves 
that Q' > Q. 



X 



Now by induction Q < |f + 21g£, and Jf < A'/B + 2 (from Lemma ETOfl 
again, whose hypothesis A'/B < 2B^ is satisfied, since A' < Bi(3 2e , thus 
A'/B<(3 1 <2B[), soQo < A'/B + 21gn, and Q' < A/B + 21gn. n 
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Figure 3.5: Divide and conquer short division: a graphical view. Left: with 
plain multiplication; right: with short multiplication. See also Fig. 11.11 



Barrett's division algorithm 



Here we consider division using Barrett's algorithm f £!2.3.1|) and provide a 
rigorous error bound. This algorithm is useful when the same divisor is 
used several times; otherwise Algorithm Divide is faster (see Exercise 13. 12j) . 
Assume we want to divide a by b of n bits, with a quotient of n bits. Barrett's 
algorithm is as follows: 
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1. Compute the reciprocal r of b to n bits [rounding to nearest] 
2- q <— °n{ a x r ) [rounding to nearest] 

The cost of the algorithm in the FFT range is 3M(n): 2M(ri) to compute 
the reciprocal with the wrap-around trick, and M(n) for the product a x r. 

Lemma 3.4.4 At step® of Barrett's algorithm, we have \a — bq\ < ||6|. 

Proof. By scaling a and b, we can assume that b and g are integers, 
2 n ~ 1 <b,q< 2", thus a < 2 2n . We have r = \ + e with |e| < ± ulp(2" n ) = 
2~ 2n . Also g = ar + e' with |e'| < |ulp(g) = ~ since g has n bits. Thus 
g = a(~ + e) + e' = | + ae + e', and |6g — o| = |6||oe + e'| < ||6|. n 

As a consequence, g differs by at most one unit in last place from the n-bit 
quotient of a and b, rounded to nearest. 

Lemma 13.4.41 can be applied as follows: to perform several divisions with 
a precision of n bits with the same divisor, precompute a reciprocal with 
n + g bits, and use the above algorithm with a working precision of n + g 
bits. If the last g bits of q are neither 000 . . . 00x nor 111 ... 11a; (where x 
stands for or 1), then rounding q down to n bits will yield o n [a/b) for a 
directed rounding mode. 



3.5 Square Root 

Algorithm FPSqrt computes a floating-point square root, using as subrou- 
tine Algorithm SqrtRem to determine an integer square root (with remain- 
der). It assumes an integer significand m, and a directed rounding mode (see 
Exercise 13. 131 for rounding to nearest). 

Theorem 3.5.1 Algorithm FPSqrt returns the square root of x, correctly- 
rounded. 

Proof. Since mi has 2n or 2n — 1 bits, s has exactly n bits, and we have 
x > s 2 2 2k+ f , thus \fx > s2 k+J / 2 . On the other hand, SqrtRem ensures that 
r < 2s, thusx2" / = (s 2 +r)2 2k +m < s 2 +r+l < (s+1) 2 . Since y := s-2 k+f l 2 
and y + = (s + 1) • 2 k+ ^ 2 are two consecutive n-bit floating-point numbers, 
this concludes the proof. rj 
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Algorithm 46 FPSqrt 

Input: x = to ■ 2 e , a target precision n, a rounding mode o 

Output: y = o n (y / x) 

If e is odd, (to', /) <— (2to, e — 1), else (to', /) <— (to, e) 

If to' has less than 2n - 1 bits, then (to', /) <- (to'2 2 ^, / - 2£) 

Write to' := mi2 2k + Too, TOi having 2n or 2n — 1 bits, < to < 2 2fc 

(s,r) <— SqrtRem(mi) 

If round to zero or down or r = m = 0, return s ■ 2 k+ ^ 2 

else return (s + 1) • 2 k+ ^ 2 . 



Note: in the case 5 = 2 n — 1, s + 1 = 2 n is still representable with n bits, 
and y + is in the upper binade. 

An different method is to use a subroutine computing an approximation 
to a reciprocal square root ( ^3.5.1(1 . as follows: Step |U costs M(n/2). Since 



Algorithm 47 FPSqrt2 

Input: an n-bit floating-point number x 

Output: a n-bit approximation y of \fx 



ApproximateRecSquareRoot (x) 



r 

t <- o n/2 (hr) 
u <— x — t 2 
return t + |« 



the n/2 most significant bits of t 2 are known to match those of x in step El 
we can perform a transform mod x n l 2 — 1 in the FFT range, hence step El 
costs M{n/A). Finally step HI costs M(n/2). In the FFT range, with a cost 
of |M(n/2) for ApproximateRecSquareRoot (rr) ( M3.5.1|) . the total cost 
is 3M(n). (See ^3.81 for faster algorithms.) 

3.5.1 Reciprocal Square Root 

In this section we describe an algorithm to compute the reciprocal square 
root aT 1 ! 2 of a floating-point number a, with a rigorous error bound. 

Lemma 3.5.1 Let a, x > 0, p = a~ 1 / 2 , and x' = x + |(1 — ax 2 ). Then 

0<p-x'<~(p-x) 2 , 
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for some 6 £ [min(:r,p),max(x,p)]. 

Proof. The proof is very similar to that of Lemma 13.4.11 Here we use 

f{t) = a — 1/t 2 , with p the root of /. Eq. ()3.5|) translates to: 

P = x + -(l-ax) + - — (p - x) , 
which proves the Lemma. rj 



Algorithm 48 ApproximateRecSquareRoot 
Input: integer A with j3 n < A < 4{] n 
Output: integer X, (3 n /2 < X < f3 n 
1: if n < 2 then return mm((3 n - 1, L/? n /V^4/F"J) 



^^ L^J, h<-n-£ 

A h <- [Ap-'l 

Xh ^ApproximateRecSquareRoot (Ah) 

T ^ A(Xl) 

T h <- [T{3- n \ 

T e <- P 2h - T h 

U <- T e X h 

return mm((3 n - l,X h p e + [U f3 £ ~ 2h / 2]) . 



Note: even if A h X\ < (3 3h at line H we might have AX\ > (3 n+2h at 
line which might cause T^ to be negative. 

Lemma 3.5.2 As long as (3 > 38, if X is the value returned by Algorithm 
ApproximateRecSquareRoot, a = A(3~ n , and x = Xf3~ n , then 
1/2 < x < 1 and 

\x-a~ 1/2 \ <2/T n . 

Proof. We have 1 < a < 4. Since X is bounded by (3 n — 1 at lines rj] and HI 
we have x, Xh < 1, with x^ = Xhj3~ h . We prove the statement by induction. 
It is true for n < 2. Now assume the value Xh at step ^satisfies: 

|^-a- 1/2 |</?- h , 

where ah = Ah(3~ h . We have three sources of error, that we will bound in 
this order: 
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1. the rounding errors in lines H and H2 

2. the mathematical error given by Lemma \'S. 5. 11 which would occur even 
if all computations were exact; 

3. the error coming from the fact we use Ah instead of A in the recursive 
call at step0] 



At step |5] we have exactly: 



t := Tp- n ~ 2h = ax 2 h , 



which gives \t h — ax 2 h \ < (3 2h with t h ■= T h (3 2/l , and in turn \tg— (1 — ax\)\ < 
p~ 2h with t e := T$~ 2h . At step El it follows \u - x h {l - ax 2 h )\ < /3" 2/ \ where 
u = Up- 3h . Thus finally \x - [x h + f{l - ax 2 h )]\ < \fi~ 2h + ±5" n , taking 
into account the rounding error in the last step. 

Now we apply Lemma [3.5. II to x — > Xh, x' — » x, to bound the mathemat- 
ical error, assuming no rounding error occurs: 

<a- 1 /2_ a; <M (a -i/2_ )2 

- 2 04 V h) , 

which givefl \a~ 1 / 2 - x\ < 3.04(a" 1 / 2 - x h ) 2 . Now \a~ 1 / 2 - a~ 1/2 \ < 
||o — a^l^ -3 / 2 for i/ G [min(a/j, a),max(a/j, a)], thus |a -1 / 2 — a^ ' \< (3~ h /2. 
Together with the induction hypothesis \xh — a h \ < 2f3~ h , it follows that 
\a~ 1/2 - x h \ < \fi- h . Thus {a- 1 / 2 - x\ < 19/T 2/ \ 
The total error is thus bounded by: 

\a~ 1/2 -x\ < 3/2/T™ + 19/r 2h . 

Since 2h > n + 1, we see that 19/?- 2/l < l/2/5" n for j3 > 38, and the proof 
follows. rj 

Let R(n) be the cost of ApproximateRecSquareRoot for an n-bit 
input. We have h,£~ n/2, thus the recursive call costs R(n/2), step El costs 
M(n/2) to compute X%, and M(n) for the product A(X%) -- or M(3n/4) 
in the FFT range using the wrap-around trick described in ^3.4.11 since we 



1 Since e [ir^a" 1 / 2 ] and \x h - a~ l / 2 \ < f/3"' 1 , we have > x h - \$~ h ; thus 
x h /d < 1 + 5(3~ h /(20) < 1 + 5/3-' 1 (remember G [aj/,, a~ 1/2 ]), and it follows that > 1/2. 
For /3 > 38, since h > 2, we have l+5/3~ ft < 1.0035, thus \x\j0^ < (3/(26»))1.0035 3 < 3.04. 



122 Modern Computer Arithmetic, version 0.3 of June 10, 2009 

know the upper n/2 bits of the product gives 1 - - and again M(n/2) for 
step E We get R(n) = R(n/2) + 2M(n) - - R(n/2) + \M{n) in the FFT 
range — , which yields R(n) = 4M(n) -- ^M(n) in the FFT range. 

The above algorithm is not the optimal one in the FFT range, especially 
when using an FFT algorithm with cheap point-wise products (like the com- 
plex FFT, see < J3.3.1|) . Indeed, Algorithm ApproximateRecSquareRoot 
uses the following form of Newton's iteration: 



x' = x + — (1 — ax 2 ) 



It might be better to write: 

x 1 = x -\ — (x — ax 3 ). 
2 K ' 

Indeed, the product x 3 might be computed with a single FFT transform of 
length 3n/2, replacing the point-wise products x\ by x 3 , with a total cost 
of about |M(n). Moreover, the same idea can be used for the full product 
ax 3 of 5n/2 bits, but whose n/2 upper bits match those of x, thus with the 
wrap-around trick a transform of length 2n is enough, with a cost of M{n) 
for the last iteration, and a total cost of 2M(n) for the reciprocal square root. 
With that result, Algorithm FPSqrt2 costs 2.25M(n) only. 

3.6 Conversion 

Since most software tools work in radix 2 or 2 k , and humans usually enter 
or read floating-point numbers in radix 10 or 10 fc , conversions are needed 
from one radix to the other one. Most applications perform only very few 
conversions, in comparison to other arithmetic operations, thus the efficiency 
of the conversions is rarely critical^. The main issue here is therefore more 
correctness than efficiency. Correctness of floating-point conversions is not 
an easy task, as can be seen from the past bugs in Microsoft Excejj. 

The algorithms described in this section use as subroutine the integer- 
conversion algorithms from Chapter [0 As a consequence, their efficiency 
depends on the efficiency of the integer-conversion algorithms. 



2 An important exception is the computation of billions of digits of constants like 7r, log 2, 
where a quadratic conversion routine would be far too slow. 

3 In Excel 2007, the product 850 x 77.1 prints as 100, 000 instead of 65, 535; this is really 
an output bug, since if one multiplies "100,000" by 2, one gets 131,070. An input bug 
occurred in Excel 3.0 to 7.0, where the input 1.40737488355328 gave 0.64. 
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3.6.1 Floating-Point Output 

In this section we follow the convention of using small letters for parameters 
related to the internal radix b, and capitals for parameters related to the 
external radix B. Consider the problem of printing a floating-point number, 
represented internally in radix b (say b = 2) in an external radix B (say 
B = 10). We distinguish here two kinds of floating-point output: 

• fixed- format output, where the output precision is given by the user, 
and we want the output value to be correctly rounded according to the 
given rounding mode. This is the usual method when values are to be 
used by humans, for example to fill a table of results. In that case the 
input and output precision may be very different: for example one may 
want to print 1000 digits of 2/3, which uses only one digit internally 
in radix 3. Conversely, one may want to print only a few digits of a 
number accurate to 1000 bits. 

• free-format output, where we want the output value, when read with 
correct rounding (usually to nearest), to give back the initial number. 
Here the minimal number of printed digits may depend on the input 
number. This kind of output is useful when storing data in a file, while 
guaranteeing that reading the data back will produce exactly the same 
internal numbers, or for exchanging data between different programs. 

In other words, if we denote by x the number we want to print, and X the 
printed value, the fixed-format output requires \x — X\ < ulp(X), and the 
free-format output requires \x — X\ < ulp(x) for directed rounding. Replace 
< ulp(-) by < |ulp(-) for rounding to nearest. 
Some comments on Algorithm PrintFixed: 

• it assumes that we have precompiled values of \ B = 0(^4) for any 
possible external radix B (the internal radix b is assumed to be fixed for 
a given implementation). Assuming the input exponent e is bounded, 
it is possible - - see Exercise 13.151 - to choose these values precisely 
enough that 

25 = 1+ (e-l)^| , (3.8) 

log B_ 

thus the value of A at step ^ is simply read from a table. 
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Algorithm 49 PrintFixed 

Input: x = f ■ b e ~ p with f,e,p integers, b p ~ l < |/| < kP, external radix B 

and precision F, rounding mode o 
Output: X = F ■ B E ~ P with F,E integers, F p_1 < \F\ < B p , such that 
X = o(x) in radix B and precision P 

1: A <- o(\ogb/\ogB) 

2: E<- 1+ [(e- 1)AJ 

3: g <- [P/Al 

4: y <— o(xB p ~ E ) with precision g 

5: If one cannot round y to an integer, increase g and goto|U 

6: F <— Integer (y, o). 

7: If |F| > F p then E <- E + 1 and gotoH 

8: return F, F. 



• the difficult part is step 01 where one has to perform the exponentiation 
B P ~ E — remember all computations are done in the internal radix b - 
and multiply the result by x. Since we expect an integer of q digits in 
step |H1 there is no need to use a precision of more than q digits in these 
computations, but a rigorous bound on the rounding errors is required, 
so as to be able to correctly round y. 

• in step "one can round y to an integer" means that the interval 
containing all possible values ofxB p ~ E — including the rounding errors 
while approaching xB p ~ E , and the error while rounding to precision 
q - - contains no rounding boundary (if o is a directed rounding, it 
should contain no integer; if o is rounding to nearest, it should contain 
no half-integer). 

Theorem 3.6.1 Algorithm PrintFixed is correct. 

Proof. First assume that the algorithm finishes. Eq. (|3.8|) implies B E ~ l < 
b e ~ l , thus \x\B p ~ E > F p -\ which implies that \F\ > F p " 1 at step El Thus 
F p_1 < \F\ < B p at the end of the algorithm. Now, printing x gives F • B a 
iff printing xB k gives F • B a+k for any integer k. Thus it suffices to check 
that printing xB p ~ E gives F, which is clear by construction. 

The algorithm terminates because at step HI xB p ~ E , if not an integer, 
cannot be arbitrarily close to an integer. If P — E > 0, let k be the number 
of digits of B P ~ E in radix b, then xB p ~ E can be represented exactly with 



Modern Computer Arithmetic, §3.6 125 

p + k digits. If P — E < 0, let g = B E ~ P , of k digits in radix b. Assume 
f/g = n + e with n integer; then / — gn = ge. lie is not zero, ge is a non-zero 
integer, thus |e| > 1/g > 2~ k . 

The case \F\ > B p at stepEJcan occur for two reasons: either \x\B p ~ E > 
B p , thus its rounding also satisfies this inequality; or \x\B p ~ E < B p , but 
its rounding equals B p (this can only occur for rounding away from zero or 
to nearest). In the former case we have \x\B p ~ E > B p ~ l at the next pass 
in step 0J while in the latter case the rounded value F equals B p_1 and the 
algorithm terminates. rj 

Now consider free- for mat output. For a directed rounding mode we want 
| a; — X\ < ulp(x) knowing \x — X\ < ulp(X). Similarly for rounding to 
nearest, if we replace ulp by | ulp. 

It is easy to see that a sufficient condition is that ulp(X) < ulp(x), or 
equivalently B E ~ P < b e ~ p in Algorithm PrintFixed (with P not fixed at 
input, which explain the "free- format" name). To summarise, we have 

b e ~ l < \x\ < b\ B E ~ l < \X\ < B E . 

Since |x| < b e , and X is the rounding of x, we must have B E ~ l < b e . It 
follows that B E ~ P < b e B 1 ~ p , and the above sufficient condition becomes: 

log b 

P^l+P^- 
log±> 

For example, with b = 2 and B = 10, p = 53 gives P > 17, and p = 24 gives 
P > 9. As a consequence, if a double-precision IEEE 754 binary floating- 
point number is printed with at least 17 significant decimal digits, it can be 
read back without any discrepancy, assuming input and output are performed 
with correct rounding to nearest (or directed rounding, with appropriately 
chosen directions). 

3.6.2 Floating-Point Input 

The problem of floating-point input is the following. Given a floating-point 
number X with a significand of P digits in some radix B (say B = 10), a 
precision p and a given rounding mode, we want to correctly round AT to a 
floating-point number x with p digits in the internal radix b (say b = 2). 
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At first glance, this problem looks very similar to the floating-point output 
problem, and one might think it suffices to apply Algorithm PrintFixed, 
simply exchanging (b,p,e,f) and (B,P,E,F). Unfortunately, this is not 
the case. The difficulty is that, in Algorithm PrintFixed, all arithmetic 
operations are performed in the internal radix b, and we do not have such 
operations in radix B (see however Ex. I1.3UJ1 . 



3.7 Exercises 

Exercise 3.1 Determine exactly for which IEEE 754 double precision numbers 
does the trick described in ^3. 1.51 work, to get the next number away from zero. 

Exercise 3.2 (Kidder, Boldo) Assume a binary representation. The "round- 
ing to odd" mode jHSl 11131 I167J is defined as follows: in case the exact value is 
not representable, it rounds to the unique adjacent number with an odd signif- 
icand. ("Von Neumann rounding" [23] omits the test for the exact value being 
representable or not, and rounds to odd in all nonzero cases.) Note that overflow 
never occurs during rounding to odd. Prove that if y = round(x,p + fc,odd) and 
z = round(y, p, nearest_even) , and k > 1, then z = round(x,p, nearest_even), i.e., 
the double-rounding problem does not occur. 



Exercise 3.3 Show that, if \fa is computed using Newton's iteration for a 



-1/2. 



3 

x 1 = x + -(1 — ax 2 ) 

(see £13.5.1)) and the identity y/a = a x a -1 ' 2 with rounding mode "round towards 
zero" , then it might never be possible to determine the correctly rounded value of 
y/a, regardless of the number of additional "guard" digits used in the computation. 

Exercise 3.4 How does truncating the operands of a multiplication to n + g 
digits (as suggested in 83.3)1 affect the accuracy of the result? Considering the 
cases g = 1 and g > 1 separately, what could happen if the same strategy were 
used for subtraction? 

Exercise 3.5 Is the bound of Theorem 13.3.11 optimal? 

Exercise 3.6 Adapt Mulders' short product algorithm |134j to floating-point 
numbers. In case the first rounding fails, can you compute additional digits with- 
out starting again from scratch? 
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Exercise 3.7 If a balanced ternary system is used, that is radix 3 with possible 
digits {0, ±1}, then "round to nearest" is equivalent to truncation. 

Exercise 3.8 (Percival) One computes the product of two complex floating point 
numbers zq = ao + ibo and Z\ = a\ + ib\ in the following way: x a = o(aoai), 
x b = o(bob!), y a = o(o &i), Vb = °(ai6 ), z = o(x a - x b ) + o(y a + y b ) • i. All com- 
putations being done in precision n, with rounding to nearest, compute an error 
bound of the form \z — zqZi\ < c2~ n \zQZ\\. What is the best possible c? 

Exercise 3.9 Show that, if /i = 0(e) and ne <C 1, the bound in Theorem 13,3,21 
simplifies to 

\\z' - z\\oo = 0(\x\ ■ 1 2/| • ne). 

If the rounding errors cancel we expect the error in each component of z' to be 
0(\x\ -\y\ ■n l ' 2 e). The error \\z' — z||oo could be larger since it is a maximum of TV = 
2 n component errors. Using your favourite implementation of the FFT, compare 
the worst-case error bound given by Theorem 13.3.21 with the error \\z' — z\\oo that 
occurs in practice. 

Exercise 3.10 (Enge) Design an algorithm that correctly rounds the product of 
two complex floating-point numbers with 3 multiplications only. [Hint: assume all 
operands and the result have n-bit significand.] 

Exercise 3.11 Write a computer program to check the entries of Table EP1 are 
correct and optimal. 

Exercise 3.12 To perform k divisions with the same divisor, which of Algorithm 
Divide and of Barrett's algorithm is the fastest one? 

Exercise 3.13 Adapt Algorithm FPSqrt to the rounding to nearest mode. 

Exercise 3.14 Prove that for any n-bit floating-point numbers (x,y) ^ (0,0), 
and if all computations are correctly rounded, with the same rounding mode, the 



result of xj \J x 2 + y 2 lies in [—1, 1], except in a special case. What is this special 
case? 

Exercise 3.15 Show that the computation of E in Algorithm PrintFixed, step|2 

as long as there is no integer n such 



is correct — i.e., E = 1 + 



■p logfe 



that U.^I-1 

e— 1 log b 



logB 

< e, where e is the relative precision when computing A: A 



■wf(1 + 0) with \6\ < e. For a fixed range of exponents — e max < e < e max , 
deduce a working precision e. Application: for 6=2, and e max = 2 31 , compute 
the required precision for 3 < B < 36. 
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Exercise 3.16 (Lefevre) The IEEE 754 standard requires binary to decimal 
conversions to be correctly rounded in the range m ■ 10 n for \m\ < 10 17 — 1 and 
\n\ < 27 in double precision. Find the hardest-to-print double precision number 
in that range (with rounding to nearest for example). Write a C program that 
outputs double precision numbers in that range, and compare it to the sprintf 
C-language function of your system. Same question for a conversion from the 
IEEE 754R binary64 format (significand of 53 bits, 2" 1074 < \x\ < 2 1024 ) to the 
decimal 64 format (significand of 16 decimal digits). 

Exercise 3.17 Same question as the above, for the decimal to binary conversion, 
and the at of C-language function. 

3.8 Notes and References 

In her PhD J12BI Chapter V], Valerie Menissier-Morain discusses alternatives to 
the classical non-redundant representation considered here: continued fractions 
and redundant representations. She also considers in Chapter III the theory of 
computable reals, their representation by .B-adic numbers, and the computation 
of algebraic or transcendental functions. 

Nowadays most computers use radix two, but other choices (for example radix 
16) were popular in the past, before the widespread adoption of the IEEE 754 
standard. A discussion of the best choice of radix is given in |33j . 

The main reference for floating-point arithmetic is the IEEE 754 standard [I], 
which defines four binary formats: single precision, single extended (deprecated), 
double precision, and double extended. The IEEE 854 standard jSH] defines radix- 
independent arithmetic, and mainly decimal arithmetic. Both standards are re- 
placed by the revision of IEEE 754 (approved by the IEEE Standards Committee 
on June 12, 2008). 

The rule regarding the precision of a result given possibly differing precisions 
of operands was considered in J40U98J . 

Floating-point expansions were introduced by Priest J143J . They are mainly 
useful for a small numbers of summands, mainly two or three, and when the main 
operations are additions or subtractions. For a larger number of summands the 
combinatorial logic becomes complex, even for addition. Also, except for simple 
cases, it seems difficult to obtain correct rounding with expansions. 

Some good references on error analysis of floating-point algorithms are the 
books by Higham [HE] and Muller |135j . Older references include Wilkinson's 
classics UliaHIT]. 

Collins, Krandick and Lefevre proposed algorithms for multiple-precision floating 
point addition |61|lll7j . 
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The problem of leading zero anticipation and detection in hardware is classical; 
see J147J for a comparison of different methods. 

The idea of having a "short product" together with correct rounding was stud- 
ied by Krandick and Johnson |112j in 1993, where they attribute the term "short 
product" to Knuth. They considered both the schoolbook and the Karatsuba do- 
mains. In 2000 Mulders (134] invented an improved "short product" algorithm 
based on Karatsuba multiplication, and also a "short division" algorithm. The 
problem of consecutive zeros or ones — also called runs of zeros or ones — has 
been studied by several authors in the context of computer arithmetic: Iordache 
and Matula jlOOj studied division (Theorem 13.4. 1(1 . square root, and reciprocal 
square root. Muller and Lang |115j generalised their results to algebraic functions. 

The Fast Fourier Transform (FFT) using complex floating-point numbers is 
described in Knuth jllOj . See also the asymptotically faster algorithm by Fiirer 
|80j . Many variations of the FFT are discussed in the books by Crandall (641 I65j . 
For further references, see the Notes and References section of Chapter [2j 

Theorem 13.3.21 is from Percival |140j : previous rigorous error analyses of com- 
plex FFT gave very pessimistic bounds. Note that the proof given in |14()j is 
incorrect, but we have a correct proof (see J3H] an d Ex. l3.8() . 

The concept of "middle product" for power series is discussed in [SHj ■ Bostan, 
Lecerf and Schost have shown it can be seen as a special case of 
"Tellegen's principle", and have generalised it to operations other than multipli- 
cation [32- The link between usual multiplication and the middle product using 
trilinear forms was mentioned by Victor Pan in J138J for the multiplication of 
two complex numbers: "The duality technique enables us to extend any successful 
bilinear algorithms to two new ones for the new problems, sometimes quite dif- 
ferent from the original problem ■ ■ ■ " David Harvey has shown how to efficiently 
implement the middle product for integers |94j . A detailed and comprehensive 
description of the Payne and Hanek argument reduction method can be found in 

D3S1- 

The 2M(n) reciprocal algorithm — with the wrap-around trick — of § 13.4.11 is 
due to Schonhage, Grotefeld and Vetter. |152j . It can be improved, as noticed by 
Bernstein (17j . If we keep the FFT-transform of x, we can save gM(n) (assuming 
the term-to-term products have negligible cost), which gives ^M{n). Bernstein 
also proposes a "messy" |M(n) algorithm ^7j. Schonhage's |M(n) algorithm 
is better |151j . The idea is to write Newton's iteration as x' = 2x — ax 2 . If x is 
accurate to n/2 bits, then ax 2 has (in theory) 2n bits, but we know the upper 
n/2 bits cancel with x, and we are not interested in the low n bits. Thus we can 
perform one modular FFT of size 3n/2, which amounts to cost M(3n/4). See also 
|63j for the roundoff error analysis when using a floating-point multiplier. 

Bernstein in (TJj obtains faster square root algorithms in the FFT domain, by 



130 Modern Computer Arithmetic, version 0.3 of June 10, 2009 

caching some Fourier transforms, more precisely he gets -g-M(n) for the square 
root, and |M(n) for the simultaneous computation of x 1 ' 2 and x~ 1 ' 2 . 

Classical floating-point conversion algorithms are due to Steele and White 
|159j . Gay [S3], and Clinger jBH]; most of these authors assume fixed precision. 
Mike Cowlishaw maintains an extensive bibliography of conversion to and from 
decimal arithmetic (see fl5.H|l . What we call "free- format" output is called "idem- 
potent conversion" by Kalian |104j ; see also Knuth jllOl exercise 4.4-18]. 

Algebraic Complexity Theory by Biirgisser, Clausen and Shokrollahi [H^ is an 
excellent book on topics including lower bounds, fast multiplication of numbers 
and polynomials, and Strassen-like algorithms for matrix multiplication. 

There is a large literature on interval arithmetic, which is outside the scope of 
this chapter. A good entry point is the Interval Computations web page (see the 
Appendix). See also the recent book by Kulisch |114| . 

This chapter does not consider complex arithmetic, except where relevant for 
its use in the FFT. An algorithm for the complex (floating-point) square root, 
which allows correct rounding, is given in |74| . 



Chapter 4 

Newton's Method and Function 
Evaluation 



Here we consider various applications of Newton's method, which 
can be used to compute reciprocals, square roots, and more gen- 
erally algebraic and functional inverse functions. We then con- 
sider unrestricted algorithms for computing elementary and spe- 
cial functions. The algorithms of this chapter are presented at 
a higher level than in Chapter |3J A full and detailed analysis of 
one special function might be the subject of an entire chapter! 



4.1 Introduction 

This chapter is concerned with algorithms for computing elementary and 
special functions, although the methods apply more generally. First we con- 
sider Newton's method, which is useful for computing inverse functions. For 
example, if we have an algorithm for computing y = Inx, then Newton's 
method can be used to compute x = expy (see §4.2.5|) . However, Newton's 
method has many other applications. In fact we already mentioned Newton's 
method in Chapters H~El but here we consider it in more detail. 

After considering Newton's method, we go on to consider various meth- 
ods for computing elementary and special functions. These methods in- 
clude power series ( 84.4|) . asymptotic expansions ( A4.5J) . continued fractions 
(£ |4.6|) . recurrence relations ( ^4.7jl . the arithmetic- geometric mean (* j4.8|) . bi- 
nary splitting ( §4.9J1 . and contour integration (J J4.10J) . The methods that we 

131 
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consider are unrestricted in the sense that there is no restriction on the at- 
tainable precision — in particular, it is not limited to the precision of IEEE 
standard 32-bit or 64-bit floating-point arithmetic. Of course, this depends 
on the availability of a suitable software package for performing floating-point 
arithmetic on operands of arbitrary precision, as discussed in Chapter EJ 

Unless stated explicitly, we do not consider rounding issues in this chapter; 
it is assumed that methods described in Chapter|3]are used. Also, to simplify 
the exposition, we assume a binary radix (j3 = 2), although most of the 
content could be extended to any radix. We recall that n denotes the relative 
precision (in bits here) of the desired approximation; if the absolute computed 
value is close to 1, then we want an approximation to within 2~ n . 



4.2 Newton's Method 

Newton's method is a major tool in arbitrary-precision arithmetic. We have 
already seen it or its p-adic counterpart, namely Hensel lifting, in previous 
chapters (see for example Algorithm ExactDivision in < J1.4.5l or the itera- 
tion (|2.5|) to compute a modular inverse in ^2.4|) . Newton's method is also 
useful in small precision: most modern processors only implement multi- 
plication in hardware, and division and square root are microcoded, using 
Newton's method. See the algorithms to compute a floating-point reciprocal 
or reciprocal square root in ^3.4. II and J J3.5.11 

This section discusses Newton's method is more detail, in the context of 
floating-point computations, for the computation of inverse roots f< J4.2.1|) . 
reciprocals f £!4.2.2|) . reciprocal square roots f< J4.2.3|) . formal power series 
f< j4.2.4|) . and functional inverses f< J4.2.5|) . We also discuss higher order Newton- 
like methods ( ^4.2.tij) . 

Newton's Method via Linearisation 

Recall that a function / of a real variable is said to have a zero (" if /((") = 0. 
If / is differentiable in a neighbourhood of £, and /'(C) 7^ 0, then £ is said 
to be a simple zero. Similarly for functions of several real (or complex) 
variables. In the case of several variables, £ is a simple zero if the Jacobian 
matrix evaluated at £ is nonsingular. 

Newton's method for approximating a simple zero £ of / is based on the 
idea of making successive linear approximations to f(x) in a neighbourhood 
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of £. Suppose that Xq is an initial approximation, and that f(x) has two 
continuous derivatives in the region of interest. From Taylor's theorerru, 

/(C) = /(so) + (C - xo)f'(xo) + (C ~ 2 X ° )2 f"(0 (4.1) 

for some point £ in an interval including {(", x }. Since f(() = 0, we see that 

X!=X - f{x )/f'(x ) 

is an approximation to C, and 

x 1 -C = O(|x -C| 2 )- 
Provided Xq is sufficiently close to C, we will have 

|a?i-C|< |x -C|/2<l. 
This motivates the definition of Newton's method as the iteration 

Xj+1 = Xj ~n^j' J = ' 1 '--- (4 - 2) 

Provided \xq — £| is sufficiently small, we expect x n to converge to £. The 
order o/ convergence will be at least two, that is 

|e„+i| < -ftT|e n | 2 

for some constant K independent of n, where e n = x n — (" is the error after 
n iterations. 

A more careful analysis shows that 

e n+ i = 7^ e n 2 + O (e n 3 ) , (4.3) 

provided / G C 3 near (". Thus, the order of convergence is exactly two if 
/"(C) 7^ and e is sufficiently small but nonzero. (Such an iteration is also 
said to be quadratically convergent.) 



^^Here we use Taylor's theorem at Xq, since this yields a formula in terms of derivatives 
at Xq, which is known, instead of at £, which is unknown. Sometimes (for example in the 
derivation of (|4.3(l ). it is preferable to use Taylor's theorem at the (unknown) zero £. 
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4.2.1 Newton's Method for Inverse Roots 

Consider applying Newton's method to the function 

f(x)=y-x- m , 

where m is a positive integer constant, and (for the moment) y is a positive 
constant. Since f'(x) = mx~^ m+1 \ Newton's iteration simplifies to 

Xj+i = Xj + Xj(l — x™y)lm. (4.4) 

This iteration converges to £ = y~ l / m provided the initial approximation x 
is sufficiently close to £■ It is perhaps surprising that (J4.4J1 does not involve 
divisions, except for a division by the integer constant m. In particular, we 
can easily compute reciprocals (the case m = 1) and reciprocal square roots 
(the case m = 2) by Newton's method. These cases are sufficiently important 
that we discuss them separately in the following subsections. 

4.2.2 Newton's Method for Reciprocals 

Taking m — 1 in (|4.4jl . we obtain the iteration 

Xj+i = Xj + Xj(l — Xji/) (4.5) 

which we expect to converge to \/y provided xq is a sufficiently good approx- 
imation. To see what "sufficiently good" means, define 

Uj = 1 — Xjy. 

Note that Uj — > if and only if Xj — > 1/y. Multiplying each side of (|4.5|) by 
y, we get 

1 - W j+ i = (1 - Uj)(l + Uj-), 

which simplifies to 

u j+1 = u]. (4.6) 

Thus 

«i = K) 2J • (4.7) 

We see that the iteration converges if and only if \uq\ < 1, which (for real Xo 
and y) is equivalent to the condition x$y G (0,2). Second-order convergence 
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is reflected in the double exponential with exponent 2 on the right-hand-side 

of (HP) . 

The iteration (|4.5|) is sometimes implemented in hardware to compute 

reciprocals of floating-point numbers, see for example |!U7j . The sign and 
exponent of the floating-point number are easily handled, so we can assume 
that y G [0.5,1.0) (recall we assume a binary radix in this chapter). The 
initial approximation xq is found by table lookup, where the table is indexed 
by the first few bits of y. Since the order of convergence is two, the number 
of correct bits approximately doubles at each iteration. Thus, we can predict 
in advance how many iterations are required. Of course, this assumes that 
the table is initialised correctl)o 

Computational Issues 

At first glance, it seems better to replace Eq. (J4.5|) by 

x j+1 = x j (2-x j y), (4.8) 

which looks simpler. However, although those two forms are mathematically 
equivalent, they are not computationally equivalent. Indeed, in Eq. (|4.5j) . if 
Xj approximates 1/y to within n/2 bits, then 1 — Xjy = 0(2~ n ^ 2 ), and the 
product of Xj by 1 — Xjy might be computed with a precision of only n/2 
bits. In the apparently simpler form (J4.8J) . 2 — Xjy = 1 + 0(2~™/ 2 ), thus the 
product of Xj by 2 — Xjy has to be performed with a full precision of n bits, 
to get Xj + i accurate to within n bits. 

As a general rule, it is best to separate the terms of different order in New- 
ton's iteration, and not try to factor common expressions. For an exception, 
see the discussion of Schonhage's |M(n) reciprocal algorithm in 



4.2.3 Newton's Method for (Reciprocal) Square Roots 

Taking m = 2 in ()4.4|) . we obtain the iteration 

x j+1 = Xj + Xj(l - x]y)/2, (4.9) 



2 In the case of the infamous Pentiumfd.lv bug |87II136| . a lookup table used for division 
was initialised incorrectly, and the division was occasionally inaccurate. Although the 
algorithm used in the Pentium did not involve Newton's method, the moral is the same - 
tables must be initialised correctly. 
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which we expect to converge to y~ 1 ^ 2 provided Xq is a sufficiently good ap- 
proximation. 

If we want to compute y 1 ^ 2 , we can do this in one multiplication after 
first computing y -1 ^ 2 , since 

y^ = y x y-V\ 

This method does not involve any divisions (except by 2). In contrast, if we 
apply Newton's method to the function f(x) = x 2 — y, we obtain Heron'qj 
iteration (see Algorithm Sqrtlnt in 31.5. 1|1 for the square root of y: 

x j+1 = i ( Xj + |-J (4.10) 

This requires a division by Xj at iteration j, so it is essentially different from 
the iteration (J4.9J) . Although both iterations have second-order convergence, 
we expect f!4.9|) to be more efficient (however this depends on the relative 
cost of division compared to multiplication). 

4.2.4 Newton's Method for Formal Power Series 

This section is not required for function evaluation, however it gives a com- 
plementary point of view on Newton's method. 

Newton's method can be applied to find roots of functions defined by 
formal power series as well as of functions of a real or complex variable. For 
simplicity we consider formal power series of the form 

A(z) = a + a±z + a 2 z 2 + • • • 

where a t e R (or any field of characteristic zero) and ord(A) = 0, i.e., a ^ 0. 
For example, if we replace y in (|4.5|) by 1 — z, and take initial approxi- 
mation Xq = 1, we obtain a quadratically-convergent iteration for the formal 
power series 



*)- = £ 



z . 

n=0 

In the case of formal power series, "quadratically convergent" means that 
ord(ej) — > +oo like 2 J , where e^ is the difference between the desired result 



3 Heron of Alexandria, circa 10-75 AD. 
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and the jth approximation. In our example, with the notation of ^4.2.21 
u = 1 — xoy = z, so Uj = z 2J and 

xa = = + O ( z 2 

3 \-z 1- z V 

Given a formal power series A(z) = X] >o a i -zJ > we can define the formal 
derivative 

A'(z) = y^jdjzi -1 = a,\ + 2a 2 z + 3a 3 2; 2 + • • ■ , 
i>o 
and the integral 

but there is no useful analogue for multiple-precision integers YTj=o a j^ ■ 
This means that some fast algorithms for operations on power series have no 
analogue for operations on integers (see for example Exercise 14. 1J1 . 

4.2.5 Newton's Method for Functional Inverses 

Given a function g(x), its functional inverse h(x) satisfies g(h(x)) = x, and 
is denoted by h(x) := g^^fa). For example, g(x) = Inx and h(x) = expx 
are functional inverses, as are g(x) = tana; and h(x) = arctanx. Using the 
function f(x) = y — g(x) in (14. 2j) . one gets a root £ of /, i.e., a value such 
that g(Q = y, or ( = g^iy): 

y-g(xj) 

Xjf_)_i Xj -\- 

g'{xj) 

Since this iteration only involves g and g', it provides an efficient way to 
evaluate h(y), assuming that g(xj) and g'(xj) can be efficiently computed. 
Moreover, if the complexity of evaluating g' is less than or equal to that of g, 
we get a means to evaluate the functional inverse h of g with the same order 
of complexity as that of g. 

As an example, if one has an efficient implementation of the logarithm, 
a similarly efficient implementation of the exponential is deduced as follows. 
Consider the root e y of the function f(x) = y— In x, which yields the iteration: 

Xj + i = Xj + Xj(y — \nxj), (4-11) 

and in turn the following algorithm (for sake of simplicity, we consider here 
only one Newton iteration): 
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Algorithm 50 LiftExp 

Input: Xj, (n/2)-bit approximation to exp(y). 

Output: Xj+i, n-bit approximation to exp(y). 

t <— InXj > computed to n-bit accuracy 

u <— y — t d> computed to (n/2)-bit accuracy 

v <— XjU d> computed to (n/2)-bit accuracy 

X j _i_ \ 4 X j ~\~ V 



4.2.6 Higher Order Newton-like Methods 

The classical Newton's method is based on a linear approximation of f(x) 
near Xq. If we use a higher-order approximation, we can get a higher-order 
method. Consider for example a second-order approximation. Eq. (I4.1J1 
becomes: 

/(C) = /(so) + (C " x )f(x ) + (C ~ g0) V (so) + (C ~/° )3 r(0- 

2 o 

Since /((") = 0, we have 

, f(x ) (C ~ Xq) 2 f"(x ) 3 

c = I °-?w —7m +O{{c ' X0)) - (4 ' 12) 

A difficulty here is that the right- hand- side of (|4.12|) involves the unknown (". 
Let ( = xo — f(xo)/f / (xo) + v, where u is a second-order term. Substituting 
this in the right- hand- side of (J4.12J) and neglecting terms of order (xq — C) 3 
yields the iteration: 



Xj-\.\ — x 



/(**) fixjrnxj) 



3+ 3 /'(*i) 2/'(x,)3 • 

For the computation of the reciprocal f ^4.2.2|) with f(x) = y — 1/x, this 
yields 

Xj + i = Xj + Xj(l — Xjy) + Xj(l — Xjy) 2 . (4-13) 

For the computation of expy using functional inversion ( ^4.2.5|) . one gets: 
Xj+i = Xj + Xj(y — IriXj) + -Xj(y — Inxj) 2 . (4-14) 
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These iterations can be obtained in a more systematic way that generalises 
to give iterations of arbitrarily high order. For the computation of the recip- 
rocal, let £j = 1 — Xjy, so Xjy = 1 — Ej and (assuming \ej\ < 1), 

1/y = Xj/(1 - £j) = Xj(l + Ej + e| H ) . 

Truncating after the term E k ~ x gives a k-th. order iteration 

X j+1 = Xj{\ + Ej + e) + • • • + E k f x ) (4.15) 

for the reciprocal. The case k = 2 corresponds to Newton's method, and the 
case k = 3 is just the iteration (|4.13|) that we derived above. 

Similarly, for the exponential we take Ej = y — Inxj = \n(x/xj), so 



F m 



x/xj = exp£j = J2^j- 

m=0 

Truncating after k terms gives a k-th order iteration 

\m=0 ' / 

for the exponential function. The case k = 2 corresponds to the Newton 
iteration, the case k = 3 is the iteration (J4.14J1 that we derived above, and 
the cases k > 3 give higher-order Newton-like iterations. For a generalisation 
to other functions, see Exercise 14.31 

4.3 Argument Reduction 

Argument reduction is a classical method to improve the efficiency of the 
evaluation of mathematical functions. The key idea is to reduce the initial 
problem to a domain where the function is easier to evaluate. More precisely, 
given / to evaluate at x, one proceeds in three steps: 

• argument reduction: x is transformed into a reduced argument x'\ 

• evaluation: f is evaluated at x'\ 

• reconstruction: f(x) is computed from f(x') using a functional identity. 



140 Modern Computer Arithmetic, version 0.3 of June 10, 2009 

In some cases the argument reduction or the reconstruction is trivial, for 
example x' = x/2 in radix 2, or f(x) = ±f(x') (some examples will illustrate 
this below). It might also be that the evaluation step uses a different function 
g instead of /; for example sin(x + 7r/2) = cos(x). 

Unfortunately, argument reduction formulae do not exist for every func- 
tion; for example, no argument reduction is known for the error function. 
Argument reduction is only possible when a functional identity relates f(x) 
and f(x') (or g(x')). The elementary functions have addition formulae such 
as 

exvj{x + y) = exp(s)exp(y), 

log{xy) = \og(x)+\og(y), 
sin(x + y) = sin(x) cos(y) + cos(x) sin(?/), 

tan(x) + tan(y) . , ^ „. 

tanz + y = - ) ' Ky \ . (4.17) 

1 — tan(x) tan(?/j 

We use these formulae to reduce the argument so that power series converge 
more rapidly. Usually we take x = y to get doubling formulae such as 

exp(2x) = exp(rr) 2 , (4.18) 

though occasionally tripling formulae such as 

sin(3x) = 3sin(x) — 4sin 3 (x) 

might be useful. This tripling formula only involves one function (sin), 
whereas the doubling formula sin(2x) = 2 sin x cos x involves two functions 
(sin and cos), although this problem can be overcome: see i J4.3.4l and J J4.9.1I 

One usually distinguishes two kinds of argument reduction: 

• additive argument reduction, where x' = x — kc, for some real constant 
c and some integer k. This occurs in particular when f(x) is periodic, 
for example for the sine and cosine functions with c = 2ir; 

• multiplicative argument reduction, where x' = x/c for some real con- 
stant c. This occurs with c = 2 in the computation of exp x when using 
the doubling formula ()4.18|) : see ^4.3.11 
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Note that, for a given function, both kinds of argument reduction might be 
available. For example, for sinx, one might either use the tripling formula 
sin(3x) = 3sinx — 4 sin 3 a;, or the additive reduction sin(x + 2kir) = sin a; 
that arises from the periodicity of sin. 

Sometime "reduction" is not quite the right word, since a functional iden- 
tity is used to increase rather than to decrease the argument. For example, 
the Gamma function T(x) satisfies the identity 

xT(x) =T(x+l), 

which can be used repeatedly to increase the argument until we reach the 
region where Stirling's asymptotic expansion is sufficiently accurate, see ^4.51 



4.3.1 Repeated Use of a Doubling Formula 

If we apply the doubling formula (J4.18J1 for the exponential function k times, 
we get 

exp(x) = exp(x/2 fe ) 2 \ 

Thus, if | a; | = 0(1), we can reduce the problem of evaluating exp(x) to that 
of evaluating exp(x/2 k ), where the argument is now 0(2~ k ). This is better 
since the power series converges more quickly for x/2 k . The cost is the k 
squarings that we need to reconstruct the final result from exp(a;/2 fc ). 

There is a trade-off here, and k should be chosen to minimise the total 
time. If the obvious method for power series evaluation is used, then the 
optimal k is of order \/n and the overall time is Oin^^Min)). We shall see 
in A4.4.3I that there are faster ways to evaluate power series, so this is not 
the best possible result. 

We assumed here that \x\ = 0(1). A more careful analysis shows that 
the optimal k depends on the order of magnitude of x (see Exercise 14. 5|) . 

4.3.2 Loss of Precision 

For some power series, especially those with alternating signs, a loss of pre- 
cision might occur due to a cancellation between successive terms. A typical 
example is expx for x < 0. Assume for example that we want 10 significant 
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digits of exp(— 10). The first ten terms x k /k\ for x = —10 are approximately: 

1., -10., 50., -166.6666667, 416.6666667, -833.3333333, 1388.888889, 
-1984.126984, 2480.158730, -2755.731922. 

Note that these terms alternate in sign and initially increase in magnitude. 
They only start to decrease in magnitude for k > \x\. If we add the first 51 
terms with a working precision of 10 decimal digits, we get an approximation 
to exp(— 10) that is only accurate to about 3 digits! 

A much better approach is to use the identity 

exp(x) = l/exp(— x) 

to avoid cancellation in the power series summation. In other cases a different 
power series without sign changes might exist for a closely related function: 
for example, compare the series (J4.22JI and (J4.23J) for computation of the error 
function erf(x). See Exercises l4.19fJ4*7?Ul 

4.3.3 Guard Digits 

Guard digits are digits in excess of the number of digits that are required in 
the final answer. Generally, it is necessary to use some guard digits during 
a computation in order to obtain an accurate result (one that is correctly 
rounded or differs from the correctly rounded result by a small number of 
units in the last place). Of course, it is expensive to use too many guard 
digits. Thus, care has to be taken to use the right number of guard digits, 
that is the right working precision. 

Consider once again the example of expx, with reduced argument x/2 k 
and x = 0(1). Since x/2 k is 0(2~ k ), when we sum the power series 
l + x/2 fc + - • • from left to right (forward summation), we "lose" about k bits 
of precision. More precisely, if x/2 k is accurate to n bits, then 1 + x/2 k is 
accurate to n + k bits, but if we use the same working precision n, we obtain 
only n correct bits. After squaring k times in the reconstruction step, about 
k bits will be lost (each squaring loses about one bit), so the final accuracy 
will be only n — k bits. If we summed the power series in reverse order instead 
(backward summation) , and used a working precision of n + k when adding 
1 and x/2 k + • • • and during the squarings, we would obtain an accuracy of 
n + k bits before the k squarings, and an accuracy of n bits in the final result. 
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Another way to avoid loss of precision is to evaluate expml(x/2 fc ), where 
the function expml is defined by 

expml (re) = exp(rr) — 1 

and has a doubling formula that avoids loss of significance when \x\ is small. 
See Exercises ETTHPl 



4.3.4 Doubling versus Tripling 

Suppose we want to compute the function sinh(x) = (e x — e~ x )/2. The 
obvious doubling formula for sinh, 

sinh(2:r) = 2sinh(x) cosh(x), 

involves the auxiliary function cosh(x) = (e x + e~ x )/2. Since cosh 2 (x) — 
sinh 2 (x) = 1, we could use the doubling formula 



sinh(2rr) = 2sinh(x)-w 1 + sinh (x), 

but this involves the overhead of computing a square root. This suggests 
using the tripling formula 

sinh(3x) = sinh(x)(3 + 4sinh 2 (a;)). (4.19) 

However, it is more efficient to do argument reduction via the doubling for- 
mula (|4.18|) for exp, because it takes one multiplication and one squaring 
to apply the tripling formula, but only two squarings to apply the doubling 
formula twice (and 3 < 2 2 ). A drawback is loss of precision, caused by can- 
cellation in the computation of exp(x) — exp(— x), when \x\ is small. In this 
case it is better to use (see Exercise I4.10|) 

sinh(x) = (expml (re) — expml (— x))/2. (4.20) 

See also the Notes and References for further comments on doubling versus 
tripling. 
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4.4 Power Series 



Once argument reduction has been applied, where possible ( fl4.3[) . one is 
usually faced with the evaluation of a power series. The elementary and 
special functions have power series expansions such as: 

v- xj w, n v^ (-1)V +1 

expx = ^—, ln(l + x) = 2_^ ■ 1 , 

(_1)^2,-+1 x 2j+l 

arctanx = > — , smns= > — — , etc. 

^ 2j/ + l ' ^ 2j + l!' 

j>0 J j>0 v J ' 

This section discusses several techniques to recommend or to avoid. We use 
the following notations: x is the evaluation point, n is the desired precision, 
and d is the number of terms retained in the power series, or d — 1 is the 
degree of the corresponding polynomial ^2 0< j < d a j xJ ■ 

If f(x) is analytic in a neighbourhood of some point c, an obvious method 
to consider for the evaluation of f(x) is summation of the Taylor series 



d-l 



jf^Hs 



f( x )=Y J {x-cy J —^ + R d {x,c). 



i=o J 

As a simple but instructive example we consider the evaluation of exp(a;) 
for \x\ < 1, using 

d-i 



exp 



(*) = £ 7T + ***(*)> ( 4 - 21 ) 



i=o J 



where \R d (x)\ < \x\ d exp(\x\)/dl < e/d\. 

Using Stirling's approximation for d\, we see that d > K(n) ~ n/lgn 
is sufficient to ensure that \Rd{x\\ = 0(2~ n ). Thus, the time required to 
evaluate (j4.21|) with Horner's rulqj is 0(nM(n)/ log n). 



By Horner's rule (with argument x) we mean evaluating the polynomial so = 
~YliO<i<d a i xJ °^ degree d (not d — 1 in this footnote) by the recurrence s<j = a^, 
Sj = a,j + Sj+iX for j = d— l,d — 2, ...,0. Thus s^ = Y2k<i<d a j x ^~ k - An evalua- 
tion by Horner's rule takes d additions and d multiplications, and is more efficient than 
explicitly evaluating the individual terms ajX 3 . 
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In practice it is convenient to sum the series in the forward direction 
(j = 0, 1, . . . , d — 1). The terms tj = x^ /j\ and partial sums 

3 

Sj = 2_^ U 

i=0 

may be generated by the recurrence tj = xtj-i/j, Sj = Sj-i + tj, and the 
summation terminated when \td\ < 2~ n /e. Thus, it is not necessary to esti- 
mate d in advance, as it would be if the series were summed by Horner's rule 
in the backward direction (j = d — 1, d — 2, . . . , 0) (see however Ex. 14. 4|) . 

We now consider the effect of rounding errors, under the assumption that 
floating-point operations are correctly rounded, i.e., satisfy 

o(x op y) = (x op y)(l + S), 

where \S\ < e and "op" = "+", "-", "x" or "/". Here e = 2~ n is the 
"machine precision" or "working precision" . Let tj be the computed value of 
tj, etc. Thus 

\tj-tj\/\tj\ < 2je + 0(e 2 ) 

and using J2j=o tj = S d < e: 

d 

\S d -S d \ < dee + J2 2je\t 3 \+0{e 2 ) 

3=1 

< (d + 2)ee + 0(e 2 ) = 0{ne). 

Thus, to get \S d -S d \= 0(2~ n ) it is sufficient that e = 0(2" n /n), i.e., we 
need to work with about lg n guard digits. This is not a significant overhead if 
(as we assume) the number of digits may vary dynamically. We can sum with 
j increasing (the forward direction) or decreasing (the backward direction). 
A slightly better error bound is obtainable for summation in the backward 
direction, but this method has the disadvantage that the number of terms d 
has to be decided in advance (see however Ex. 14. 4|) . 

In practice it is inefficient to keep the working precision e fixed. We can 
profitably reduce it when computing tj from t,_i if |t,_i| <C 1, without signif- 
icantly increasing the error bound. We can also vary the working precision 
when accumulating the sum, especially if it is computed in the backward 
direction (so the smallest terms are summed first). 
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It is instructive to consider the effect of relaxing our restriction that 
|x| < 1. First suppose that x is large and positive. Since \tj\ > |t/-i| 
when j < \x\, it is clear that the number of terms required in the sum (J4.21JI 
is at least of order |x|. Thus, the method is slow for large |x| (see - 34.31 for 
faster methods in this case). 

If |x| is large and x is negative, the situation is even worse. From Stirling's 
approximation we have 

exp \x\ 
max \tj\ ~ 



but the result is exp( — |x|), so about 2|x|/log2 guard digits are required to 
compensate for what Lehmer called "catastrophic cancellation" [77]. Since 
exp(s) = l/exp(— x), this problem may easily be avoided, but the corre- 
sponding problem is not always so easily avoided for other analytic functions. 

Here is a less trivial example. To compute the error function 

2 r _ 2 

erf(x) = —= e u du, 
V^ Jo 

we may use either the power series 

2r -A C-lV r 2j 
erf(z) = ^ £ i i7^n ^ 4 - 22 ) 



3=0 

or the (mathematically, but not numerically) equivalent 

erf(x) = ^- y ^ -. (4.23) 

V* % l-3-5---(2j + l) l ; 

For small |x|, the series (|4.22|) is slightly faster than the series (J4.23JI 
because there is no need to compute an exponential. However, the se- 
ries ()4.23|) is preferable to ()4.22|) for moderate \x\ because it involves no 
cancellation. For large \x\ neither series is satisfactory, because Q(x 2 ) terms 
are required, and in this case it is preferable to use the asymptotic expan- 
sion for erfc(x) = 1 — erf(x): see §4.51 In the borderline region use of the 
continued fraction (|4.41jl could be considered: see Exercise 14.311 

In the following subsections we consider different methods to evaluate 
power series. We generally ignore the effect of rounding errors, but the 
results obtained above are typical. 
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Assumption about the Coefficients 

We assume in this Section that we have a power series S 7 >o a i- r '' where 
dj+s/dj is a rational function R(j) of j, and hence it is easy to evaluate 
ao, Oi, 02, ■ ■ ■ sequentially. Here 5 is a fixed positive constant, usually 1 or 2. 
For example, in the case of exprr, we have 5=1 and 

a j+1 j\ 1 



a, {j + l)\ j + 1 

Our assumptions cover the common case of hypergeometric functions. For 
the more general case of holonomic functions, see ^4.9.21 

In common cases where our assumption is invalid, other good methods 
are available to evaluate the function. For example, tan x does not satisfy our 
assumption (the coefficients in its Taylor series are called tangent numbers 
and are related to Bernoulli numbers - see < J4.7.2j) . but to evaluate tan a; we 
can use Newton's method on the inverse function (arctan, which does satisfy 
our assumptions - see §4.2.5)1 . or we can use tana; = sin xj cos x. 

The Radius of Convergence 

If the elementary function is an entire function (e.g., exp, sin) then the power 
series converges in the whole complex plane. In this case the degree of the 
denominator of R(j) = dj+i/dj is greater than that of the numerator. 

In other cases (such as In, arctan) the function is not entire. The power 
series only converges in a disk because the function has a singularity on the 
boundary of this disk. In fact ln(x) has a singularity at the origin, which is 
why we consider the power series for ln(l + x). This power series has radius 
of convergence 1 . 

Similarly, the power series for arctan(x) has radius of convergence 1 be- 
cause arctan(x) has singularities on the unit circle (at ±i) even though it is 
uniformly bounded for all real x. 

4.4.1 Direct Power Series Evaluation 

Suppose that we want to evaluate a power series 2~^ ; >o a i- r ''' a ^ a gi ven argu- 
ment x. Using periodicity (in the cases of sin, cos) and/or argument reduction 
techniques ( §4.3|) . we can often ensure that |x| is sufficiently small. Thus, let 
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us assume that \x\ < 1/2 and that the radius of convergence of the series is 
at least 1. 

As above, assume that aj + $/a.j is a rational function of j, and hence easy 
to evaluate. For simplicity we consider only the case 5 = 1. To sum the 
series with error 0(2~ n ) it is sufficient to take n + 0(1) terms, so the time 
required is 0(nM(n)). If the function is entire, then the series converges 
faster and the time is reduced to O (nM(n)/ (log n)). However, we can do 
much better by carrying the argument reduction further, as demonstrated in 
the next section. 

4.4.2 Power Series With Argument Reduction 

Consider the evaluation of exp(x). By applying argument reduction k + 0(l) 
times, we can ensure that the argument x satisfies \x\ < 2~ k . Then, to 
obtain n-bit accuracy we only need to sum 0(n/k) terms of the power series. 
Assuming that a step of argument reduction is 0(M(n)), which is true for 
the elementary functions, the total cost is 0((k + n/k)M(n)). Indeed, the 
argument reduction and/or reconstruction requires 0(k) steps of 0(M(n)), 
and the evaluation of the power series of order n/k costs (n/k)M(n); so 
choosing k ~ n 1 / 2 gives cost 

O (n l/2 M(n)) . 

Examples 

For example, our comments apply to the evaluation of exp(x) using 

exp(x) = exp(x/2) 2 , 
to loglp(x) = ln(l + x) using 



loglp(x) = 21oglp 
and to arctan(:r) using 

arctan x = 2 arctan 



x 



l + VT 



x 



x 



l + VT 



Note that in the last two cases each step of the argument reduction requires 
a square root, but this can be done with cost 0(M(n)) by Newton's method 
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(fc !3.5|) . Thus in all three cases the overall cost is 0(n l l 2 M{n)), although 
the implicit constant might be smaller for exp than for loglp or arctan. See 
Exercises P1ET91 



Using Symmetries 

A not-so-well-known idea is to evaluate ln(l + x) using the power series 

1 + v\ oy- y 2j+l 



M^ - a E 



1-yJ ^2j + l 

j>0 J ^ 



with y defined by (l+y)/(l—y) = 1+x, i.e., y = x/(2+x). This saves half the 
terms and also reduces the argument, since y < x/2 if x > 0. Unfortunately 
this nice idea can be applied only once. For a related example, see Ex. 14.111 



4.4.3 Rectangular Series Splitting 

Once we determine how many terms in the power series are required for the 
desired accuracy, the problem reduces to evaluating a truncated power series, 
i.e., a polynomial. 

Let P(x) = Ylo<i<d a o x '' ^ e the polynomial that we want to evaluate, 
deg(P) < d. In the general case a; is a floating-point number of n bits, and 
we aim at an accuracy of n bits for P(x). However the coefficients Oj, or 
their ratios R(j) = dj+i/dj, are usually small integers or rational numbers 
of O(logn) bits. A scalar multiplication involves one coefficient Oj and the 
variable x (or more generally an n-bit floating-point number), whereas a non- 
scalar multiplication involves two powers of x (or more generally two n-bit 
floating-point numbers). Scalar multiplications are cheaper because the aj 
are small rationals of size O(logn), whereas x and its powers generally have 
0(n) bits. It is possible to evaluate P(x) with 0(y/n) nonscalar multipli- 
cations (plus 0(n) scalar multiplications and 0(n) additions, using 0(y/n) 
storage). The same idea applies, more generally, to evaluation of hypergeo- 
metric functions. 
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Classical Splitting 



Suppose d = jk, define y = x k , and write 

j-i fc-i 

p ( x ) = Yl y Ep e( x ) where p t( x ) = Yl aM+m x ™ 

£=0 m=0 

One first computes the powers x 2 , x 3 , . . . , x k ~ 1 , x k = y; then the polynomials 
Pg(x) are evaluated simply by multiplying au+m an d the precomputed x m (it 
is important not to use Horner's rule here, since this would involve expensive 
nonscalar multiplications). Finally, P(x) is computed from the Pe(x) using 
Horner's rule with argument y. To see the idea geometrically, write P(x) as 



y° 
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a,\x 
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a 2 x 2 


+ •• 
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i)fe 
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where y = x k . The terms in square brackets are the polynomials Pq(x), 
P 1 (x),...,P J . 1 (x). 

As an example, consider d = 12, with j = 3 and k = 4. This gives 
Po( x ) = °o + a i x + °2^ 2 + a 3 x 3 , P\{x) = a 4 + a 5 x + a e x 2 + a 7 x 3 , P2(x) = 
a 8 + a 9 x + ai x 2 + aiix 3 , then P{x) = P (x)+yPi(x)+y 2 P2(x), where y = x A . 
Here we need to compute x 2 , x 3 , x 4 , which requires three nonscalar products, 
and we need two nonscalar products to evaluate P(x), thus a total of five 
nonscalar products, instead of d — 2 = 10 with a naive application of Horner's 
rule to P(x)E 

Modular Splitting 

An alternate splitting is the following, which may be obtained by transposing 
the matrix of coefficients above, swapping j and k, and interchanging the 
powers of x and y. It might also be viewed as a generalized odd-even scheme 
( m.3.5J) . Suppose as before that d = jk, and write, with y = x J : 

3-1 fc-1 

p ( x ) = Y x£p ^ x ^ where Pe ^ = Y a j™+?y r 

1=0 m=0 



,'" 



5 P(x) has degree d — 1, SO Horner's rule performs d — 1 products, but the first one 
X x <Zd_i is a scalar product, hence there are d — 2 nonscalar products. 
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First compute y = x^ ,y 2 ,y 3 , . . . ,y k ~ l . Now the polynomials Pe(y) can be 
evaluated using only scalar multiplications of the form aj m+ e x y m . 
To see the idea geometrically, write P(x) as 



x° [o + djy + a 2j y 2 + • • • ] + 
a; 1 [ai + a j+1 y + a 2j+ iy 2 + ■ • • ] + 
x 2 [o 2 + a j+2 y + a 2 j+2y 2 + • • • ] + 

a;- 7 '" 1 [oj-_i + o 2 j-il/ + a 3i _iy 2 + •••] 

where y = xK We traverse the first row of the array, then the second row, 

then the third, . . ., finally the j-th row, accumulating sums So, Si, ... , Sj_i 

(one for each row). At the end of this process Sg = Pe(y) and we only have 

to evaluate 

j'-i 

P( x ) = J2x £ S e . 

e=o 

The complexity of each scheme is almost the same (see Ex. I4.12JI . With 
d = 12 (j = 3 and fc = 4) we have Po(y) = do + 03?/ + 062/ 2 + agy 3 , -Pi(y) = 
Oi + a 4 y + a 7 2/ 2 + aioy 3 , -P2G/) = a 2 + ^5?/ + Os2/ 2 + a n2/ 3 - We first compute 
y = x 3 , y 2 and y 3 , then we evaluate Po(y) in three scalar multiplications a^y, 
a^y 2 , and ag?/ 3 and three additions, similarly for Pi and P 2 , and finally we 
evaluate P(x) using 

P(x) = P (2/)+^i(2/) + x 2 P 2 (2/), 

(here we might use Horner's rule). In this example, we have a total of six 
nonscalar multiplications: four to compute y and its powers, and two to 
evaluate P(x). 

Complexity of Rectangular Series Splitting 

To evaluate a polynomial P(x) of degree d — 1 = jk — 1, rectangular series 
splitting takes 0(j + k) nonscalar multiplications — each costing 0(M(n)) - 
and O(jk) scalar multiplications. The scalar multiplications involve multi- 
plication and/or division of a multiple-precision number by small integers. 
Assume that these multiplications and/or divisions take time c(d)n each (see 
Ex. 14. 131 for a justification of this assumption). The function c(d) accounts 
for the fact that the involved scalars (the coefficients Qj or the ratios dj+i/dj) 
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have a size depending on the degree d of P{x). In practice we can usually 
regard c(d) as constant. 

Choosing j ~ k ~ rf 1 / 2 we get overall time 

0(d 1/2 M(n)+dn-c(d)). (4.24) 

If d is of the same order as the precision n of i, this is not an improvement 
on the bound 0{n l ^ 2 M{n)) that we obtained already by argument reduction 
and power series evaluation (t J4.4.2|) . However, we can do argument reduction 
before applying rectangular series splitting. Assuming that c(n) = 0(1) (see 
Exercise 14.141 for a detailed analysis), the total complexity is: 



T{n) = O (^M(n) + d 1/2 M(n) + dn\ 



.d 
Which term dominates? There are two cases: 

1. M{n) ^> n 4 / 3 . Here the minimum is obtained when the first two terms 
are equal, i.e., for d ~ n 2 / 3 , which yields T{n) = 0(n l ^M{n)). This 
case applies if we use classical or Karatsuba multiplication, since lg 3 > 
4/3, and similarly for Toom-Cook 3-, 4-, 5-, or 6-way multiplication 
(but not 7-way, since log 7 13 < 4/3). In this case T(n) 3> n 5 / 3 . 

2. M{n) <C n 4 ' 3 . Here the minimum is obtained when the first and the 
last terms are equal. The optimal value of d is then y/M(n), and we 
get an improved bound Q(ny/M(n)) 3> n 3 / 2 . We can not approach the 
0(n 1+£ ) that is achievable with AGM-based methods (if applicable) - 
see 



4.5 Asymptotic Expansions 

Often it is necessary to use different methods to evaluate a special function 
in different parts of its domain. For example, the exponential integrajj 

El (a;)= f°° eM ~ u) du (4.25) 



(i 



The functions Ei(x) and Ei(a-) = PV/^ exp(i)* are both called "exponential inte- 
grals". Closely related is the "logarithmic integral" li(x) = Ei(lnx) = PV/q j^. Here the 
integrals PV J ■ ■ ■ should be interpreted as Cauchy principal values if there is a singularity 
in the range of integration. The power series (|4.26|) is valid for x G C if | arga;| < n (see 
Exercise 14.160 . 
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is defined for all x > 0. However, the power series 

Ei(x) + 7 + lnz = J^ - r X ( 4 - 26 ) 

j=i -^ 

is unsatisfactory as a means of evaluating Ei(x) for large positive x, for the 
reasons discussed in 84 .41 in connection with the power series (|4.22jl for erf(x), 
or the power series for exp(x) (x negative). For sufficiently large positive x 
it is preferable to use 

e* Ei(aO = J2 ~ ~ 1)! 1~ 1)J ' + R *(x), (4-27) 

i=i ^ 



where 



l) fc exp(a;) / ^^ du . (4.28) 

u k+L 



Note that 

jfe! 
\Rk(x)\ < ^r, 

so 

lim Rk{x) = 0, 

x— >+oo 

but linifc^oo Rk{x) does not exist. In other words, the series 

^ si 

is divergent. In such cases we call this an asymptotic series and write 

e -E lW ~S "-y . (4.29) 

i>0 

Although they do not generally converge, asymptotic series are very useful. 
Often (though not always!) the error is bounded by the last term taken in the 
series (or by the first term omitted). Also, when the terms in the asymptotic 
series alternate in sign, it can often be shown that the true value lies between 
two consecutive approximations obtained by summing the series with (say) 
k and k + 1 terms. For example, this is true for the series (J4.29J) above, 
provided x is real and positive. 
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When x is large and positive, the relative error attainable by using (J4.27JI 
with k ~ x is 0(x 1 / 2 exp(— x)), because 

\R k (k)\ < k\/k k+1 = 0{k- 1/2 exp{-k)) (4.30) 

and the leading term on the right side of (J4.27|) is 1/x. Thus, the asymp- 
totic series may be used to evaluate Ei(x) to precision n whenever x > 
nln2 + O(lnn). More precise estimates can be obtained by using a version 
of Stirling's approximation with error bounds, for example 

h\ k f k\ k / 1 

V27rk < k\ < ( - ) V2irk exp 



ej \ej \12k 

If x is too small for the asymptotic approximation to be sufficiently accurate, 
we can avoid the problem of cancellation in the power series ()4.26|) by the 
technique of Exercise 14.191 However, the asymptotic approximation is faster 
and hence is preferable whenever it is sufficiently accurate. 

Examples where asymptotic expansions are useful include the evaluation 
of erfc(x), T(x), Bessel functions, etc. We discuss some of these below. 

Asymptotic expansions often arise when the convergence of series is accel- 
erated by the Euler-Maclaurin sum formulctj- For example, Euler's constant 
7 is defined by 

7= lim (H N -\nN), (4.32) 

iV— >oo 

where Hn = 2i< -<jv Vi * s a harmonic number. However, the limit in the 
definition ()4.32|) converges slowly, so to evaluate 7 accurately we need to 
accelerate the convergence. This can be done using the Euler-Maclaurin 
formula. The idea is to split the sum H^ into two parts: 

N 



H N = H p _ 1 + J2 1 /J- 



j=v 



7 The Euler-Maclaurin sum formula is a way of expressing the difference between a sum 
and an integral as an asymptotic expansion. For example, assuming that a G Z, b G Z, 
a < b, and f(x) satisfies certain conditions, one form of the formula is 

E /(*)- f b f^ d ^ f{a) t m + T,^ (/ (2fc - 1) W-/ (2fc - 1) (a))- (4.31) 

Often we can let b — > +cx) and omit the terms involving b on the right-hand-side. For 
more information see the Notes and References. 
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We approximate the second sum using the Euler-Maclaurin formula (J4.31|) 
with a = p, b = N, fix) = 1/x, then let N — > +00. The result is 



J~H p -]np + J2^- (4-33) 

k>i 



-J 2kp 2k 
•1 y 



If p and the number of terms in the asymptotic expansion are chosen judi- 
ciously, this gives a good algorithm for computing 7 (though not the best 
algorithm: see the Notes and References for a faster algorithm that uses 
properties of Bessel functions). 

Here is another example. The Riemann zeta function Q{s) is defined for 

seC, R{s) > 1, by 

00 

coo = E r - , ( 4 - 34 ) 

and by analytic continuation for other s^ 1. £(s) may be evaluated to any 
desired precision if m and p are chosen large enough in the Euler-Maclaurin 
formula 

p-i 

m = E r s + v — + 7-r + E T ^( s ) + ^ P ( s )' ( 4 - 35 ) 

where 

Tfe ' p(s) = (iy! pl " s " 2fc ll (s+j) ' (436) 

|^m,p(s)| < |T m+1 , p (s)(s + 2m + l)/(a + 2m + l)|, (4.37) 

m>0, p > 1, a = R{s) > —{2m + 1), and the B-2k are Bernoulli numbers. 

In arbitrary-precision computations we must be able to compute as many 
terms of an asymptotic expansion as are required to give the desired accuracy. 
It is easy to see that, if m in (J4.35JI is bounded as the precision n goes to 
00, then p has to increase as an exponential function of n. To evaluate £(s) 
from (|4.35|) to precision n in time polynomial in n, both m and p must tend to 
infinity with n. Thus, the Bernoulli numbers B2, . . . , I?2m can not be stored in 
a table of fixed sizqj, but must be computed when needed (see i j4.7|) . For this 



p s 


s - 


-s 
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in 

+ E T ^ 
fe=i 


i{s)+E 


s) = 


B-2k 

(2k)\ 


P 1 


2k-2 
-s-2k TT 
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8 In addition, we would have to store them as exact rationals, taking ~ m 2 \gm bits 
of storage, since a floating-point representation would not be convenient unless the target 
precision n were known in advance. See £14.7.21 and Exercise 14.371 
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reason we can not use asymptotic expansions when the general form of the 
coefficients is unknown or the coefficients are too difficult to evaluate. Often 
there is a related expansion with known and relatively simple coefficients. 
For example, the asymptotic expansion (|4.39|) for lnr(x) has coefficients 
related to the Bernoulli numbers, like the expansion (|4.35|) for C( s )> an d thus 
is simpler to implement than Stirling's asymptotic expansion for T(x) (see 
Exercise l4.41|) . 

Consider the computation of the error function erf x. As seen in ^4.41 
the series f!4.22|) and (|4.23|) are not satisfactory for large \x\, since they re- 
quire Q(x 2 ) terms. For example, to evaluate erf (1000) with an accuracy of 6 
digits, Eq. (J4.22|) requires at least 2, 718, 279 terms! Instead, we may use an 
asymptotic expansion. The complementary error function erfc x = 1 — erf x 
satisfies 



erfc a; V(-1) J .,,;, v , (4-38) 

with the error bounded in absolute value by the next term and of the same 
sign. In the case x = 1000, the term for j = 1 of the sum equals —0.5 x 10 -6 ; 
thus e~ x /(xy/w) is an approximation to erfc a; with an accuracy of 6 digits. 
For a function like the error function where both a power series (at x = 0) 
and an asymptotic expansion (at x — oo) are available, we might prefer to 
use the former or the latter, depending on the value of the argument and 
on the desired precision. We study here in some detail the case of the error 
function, since it is typical. 

The sum in ()4.38|) is divergent, since its j'-th term ~ \/2(j/ex 2 y. We 
need to show that the smallest term is 0(2 _n ) in order to be able to deduce 
an n-bit approximation to erfc a;. The terms decrease while j < x 2 + 1/2, 
so the minimum is obtained for j « x 2 , and is of order e~ x , thus we need 
x > \rnh\2. For example, for n = 10 6 bits this yields x > 833. However, 
since erfc a; is small for large x, say erfc a; ~ 2 _A , we need only m = n — A 
correct bits of erfc a; to get n correct bits of erf a: = 1 — erfc a;. 

Consider x fixed and j varying in the terms in the sums (|4.22j) and (J4.38J) . 
Note that x 2 i / j\ is an increasing function of j for j < x 2 , but (2j)!/(j!(4x 2 ) J ) 
is a decreasing function of j for j < x 2 . In this region the terms in Eq. (J4.38J) 
are decreasing. Thus, comparing the series (J4.22JI and ()4.38|) . we see that the 
latter should always be used if it can give sufficient accuracy. Similarly, ()4.38|) 
should if possible be used in preference to (14.23)1 . as the magnitudes of cor- 
responding terms in (J4.22)) and in (J4.23)) are similar. 
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Algorithm 51 Erf 

Input: positive floating-point number x, integer n 
Output: an n-bit approximation to erf(x) 
m <- \n- (x 2 + \nx + | In7r)/(ln2)] 
if (m + \) ln(2) < x 2 then 

t <— erfc(x) with the asymptotic expansion (J4.38|) and precision m 

Return 1 — t (in precision n) 
else if x < 1 then 

compute erf (x) with the power series f)4.22|) in precision n 
else 

compute erf(x) with the power series (I4.23J) in precision n 



Algorithm Erf computes erf (a;) for real positive x (for other real x, use 
the fact that erf (x) is an odd function, so erf(— x) = — erf (x) and erf (0) = 0). 
In Algorithm Erf, the number of terms needed if Eq. (|4.22|) or Eq. (J4.23J) is 
used is approximately the unique positive root j (rounded up to the next 
integer) of 

j(lnj — 2\nx — 1) = nln2, 

so jo > ex 2 . On the other hand, if Eq. ()4.38|) is used, then the number 
of terms k < x 2 + 1/2 (since otherwise the terms start increasing). The 
condition [m + |) ln(2) < x 2 in the algorithm ensures that the asymptotic 
expansion can give m-bit accuracy. 

An example: for x = 800 and a precision of one million bits, Equation 
(|4.23|) requires about j = 2 339 601 terms. Eq. (J4.38J) tells us that erfcx w 
2~923 335. th us we neec j only m = 76 665 bits of precision for erfcx; in this case 
Eq. ()4.38|) requires only about k = 10 375 terms. Note that using Eq. (J4.22J) 
would be slower than using Eq. ()4.23|) . because we would have to compute 
about the same number of terms, but with higher precision, to compensate 
for cancellation. We recommend using Eq. (|4.22|) only if |x| is small enough 
that any cancellation is insignificant (for example, if \x\ < 1). 

Another example, closer to the boundary: for x = 589, still with n = 10 6 , 
we have m = 499 489, which gives j = 1497 924, and k = 325 092. For 
somewhat smaller x (or larger n) it might be desirable to use the continued 
fraction (J4.41JI . see Exercise 14.311 

Occasionally an asymptotic expansion can be used to obtain arbitrarily 
high precision. For example, consider the computation of lnT(x). For large 
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positive x, we can use Stirling's asymptotic expansion 



/ 1A ln(2vr) ^ B 2k , . , 

lnr ^={ x -2) lnX - X+ 2 + ^ 2k(2k-l)x^ +Rm{xh (4 - 39) 



m— 1 

^ 2k(2k 
where R m (x) is less in absolute value than the first term neglected, that is 

B2m 

2m{2m - l^ 2 " 1 " 1 ' 

and has the same sigrjj. The ratio of successive terms tk and tk+i of the sum 
is 

ti-+A ( k x 



tk V 71 " 3 ^ 

so the terms start to increase in absolute value for (approximately) k > ttx. 
This gives a bound on the accuracy attainable, in fact 

In \R m (x) | > -2irx ln(x) + O(x) . 

However, because T(x) satisfies the functional equation T(x + 1) = xT(x), we 
can take x' = x + 5 for some sufficiently large S G N, evaluate lnTfV) using 
the asymptotic expansion, and then compute lnr(x) from the functional 
equation. See Exercise 14.211 

4.6 Continued Fractions 

In ^4.51 we considered the exponential integral E^rc). This can be computed 
using the continued fraction 



e x Ei (a;) 



1 



x 



x 



X 



1 



9 The asymptotic expansion is also valid for x € C, | arga;| < ir, x ^ 0, but the bound 
on the error term R m (x) in this case is more complicated. See for example ^ 6.1.42]. 
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Writing continued fractions in this way takes a lot of space, so instead we 
use the shorthand notation 

. ,,111223 , 

e x E 1 (x) = . 4.40 

v ; x+ 1+ x+ 1+ x+ 1+ y ' 

Another example is 



,■2 



, e~ x \ 1 1/2 2/2 3/2 4/2 5/2 
erfca;== — = — — — — . (4.41) 

/7T / X+ X+ X+ X+ X+ X+ 



CL\ &2 &3 



Formally, a continued fraction 



6i+ b 2 



is defined by two sequences (aj)jeN* and (fej)jeNj where dj, bj G C. Here 
C = CU{oo}is the set of extended complex number^. The expression / is 
defined to be lim^oo fk, if the limit exists, where 

f k = b + ^^^-..^ (4.42) 

Jk h+h+b^ b k l ; 

is the finite continued fraction obtained by truncating the infinite continued 
fraction after k quotients (called the k-th approximant) . 

Sometimes continued fractions are preferable, for computational pur- 
poses, to power series or asymptotic expansions. For example, Euler's contin- 
ued fraction (|4.40JI converges for all real x > 0, and is better for computation 
of E\(x) than the power series (J4.26J) in the region where the power series 
suffers from catastrophic cancellation but the asymptotic expansion ()4.27|) is 
not sufficiently accurate. Convergence of ()4.40|) is slow if x is small, so (I4.40J) 
is preferred for precision n evaluation of Ei(x) only when a; is in a certain 
interval, say x G (cin, C2n), C\ ~ 0.1, c 2 = In 2 ~ 0.6931 (see Exercise 14 .24(1 . 

Continued fractions may be evaluated by either forward or backward re- 
currence relations. Consider the finite continued fraction 

0-1 0-2 0-3 O'k (AATK\ 



&i+ b 2 + h+ b k 



10 Arithmetic operations on C are extended to C in the obvious way, for example 1/0 

l+oo = lxoo = oo, l/oo = 0. Note that 0/0, Ox oo and oo ± oo are undefined. 
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The backward recurrence is Rk = 1, Rk-i = b^, 

Rj = b j+1 R J+1 + a j+2 R j+ 2 (j = k - 2, . . . , 0), (4.44) 

and y = aiRi/Ro, with invariant 

Rj 1 Ctj+l Ofe 

-Rj-i fcj+ &j+i+ bk 

The forward recurrence is Pq = 0, Pi = oi, Qo = 1, Qi = 6i, 

-P?' = ^' p j-i + a i P/-2 



(j=2,...,fc), (4.45) 

Vj — Oj Vj-l "T Oj Vj 

and ?/ = Pk/Qk (see Exercise 14. 26|) . 

The advantage of evaluating an infinite continued fraction such as (J4.40J1 
via the forward recurrence is that the cutoff k need not be chosen in advance; 
we can stop when \D).\ is sufficiently small, where 

„ Pb Ph — } 

The main disadvantage of the forward recurrence is that twice as many arith- 
metic operations are required as for the backward recurrence with the same 
value of k. Another disadvantage is that the forward recurrence may be less 
numerically stable than the backward recurrence. 

If we are working with variable-precision floating-point arithmetic which 
is much more expensive than single-precision floating-point, then a useful 
strategy is to use the forward recurrence with single-precision arithmetic 
(scaled to avoid overflow/underflow) to estimate k, then use the backward 
recurrence with variable-precision arithmetic. One trick is needed: to evalu- 
ate Du using scaled single-precision we use the recurrence 

Dl=fll/fel ' ] (4 47) 

D . = -a j Q j _ 2 D j _ 1 /Q j (j =2,3,...) J l ; 

which avoids the cancellation inherent in (|4.46|) . 

By analogy with the case of power series with decreasing terms that al- 
ternate in sign, there is one case in which it is possible to give a simple 
a posteriori bound for the error occurred in truncating a continued fraction. 
Let / be a convergent continued fraction with approximants fk as in (|4.42|) . 
Then 
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Theorem 4.6.1 If dj > and bj > for all j G N*, then the sequence 
(/2fc)fceN of even order approximants is strictly increasing, and the sequence 
(_/2fc+i)fceN of odd order approximants is strictly decreasing. Thus 



J2fc < / < /2/c+l 



and 



f 



Jm—l ~r Jri 



< 



Jm J 1 



m—1 



for allm G N*. 

In general, if the conditions of Theorem 14.6.11 are not satisfied, then it 
is difficult to give simple, sharp error bounds. Power series and asymptotic 
series are usually much easier to analyse than continued fractions. 



4.7 Recurrence Relations 

The evaluation of special functions by continued fractions is a special case 
of their evaluation by recurrence relations. To illustrate this, we consider 
the Bessel functions of the first kind, J u (x). Here v and x can in general be 
complex, but we restrict attention to the case v e Z, x G R. The functions 
J v {x) can be defined in several ways, for example by the generating function 
(elegant but only useful for v G Z): 



or by the power series (also valid if v £ Z): 



-x 2 /4y 



3=0 



j\T(u + j + i) 



(4.48) 



(4.49) 



We also need Bessel functions of the second kind (Weber functions) Y v (x), 
which may be defined by: 

Y v {x) = lim W^M-^» . ( 4.50) 

p-^v sin(7r/i) 

Both J v (x) and Y v (x) are solutions of Bessel's differential equation 

x 2 y" + xy' + (x 2 - u 2 )y = . (4.51) 
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4.7.1 Evaluation of Bessel Functions 

The Bessel functions J u (x) satisfy the recurrence relation 

J„-i(x) + Ju+i(x) = — J v (x). (4.52) 

x 

Dividing both sides by J u (x), we see that 

J u -i{x) 2v_ _ J J u {x) 



J V (X) X I J u +l{x) 

which gives a continued fraction for the ratio J v (x) / J v -i(x) (y > 1] 

Jv{x) 1 1 1 _ 

J v -i(x) " 2v/x- 2(v+l)/x- 2{v + 2)/x- 



(4.53) 



However, ()4.53|) is not immediately useful for evaluating the Bessel functions 
Jo(x) or J\(x), as it only gives their ratio. 

The recurrence (J4.52)) may be evaluated backwards by Miller's algorithm. 
The idea is to start at some sufficiently large index i/, take f u i+i = 0, f v > = 1, 
and evaluate the recurrence 

/„_! + f v+ i = —fu (4.54) 

x 

backwards to obtain j v <-\, ■ ■ ■ , /o- However, (|4.54jl is the same recurrence as 
1)4.52)) . so we expect to obtain f w cJo(x) where c is some scale factor. We 
can use the identity 

oo 

Jo(x) + 2^J 2 „0r) = l (4.55) 

to determine c. 

To understand why Miller's algorithm works, and why evaluation of the 
recurrence ()4.52j) in the forward direction is numerically unstable for v > x, 
we observe that the recurrence ()4.54|) has two independent solutions: the 
desired solution J v (x), and an undesired solution Y v (x), where Y v {x) is a 
Bessel function of the second kind, see Eq. (|4.50[) . The general solution of 
the recurrence (14.54)) is a linear combination of the special solutions J v (x) 
and Y v (x). Due to rounding errors, the computed solution will also be a linear 
combination, say aJ v (x) +bY u (x). Since |l^(x)| increases exponentially with 
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v when v > ex/2, but |J„(a;)| is bounded, the unwanted component will 
increase exponentially if we use the recurrence in the forward direction, but 
decrease if we use it in the backward direction. 
More precisely, we have 

1 /ex\ u f~2~ ( 2v\ v 

Mx) ~ v^ (*0 and n(x) ~ "V ~ v U) (456) 

as i/ — > +oo with x fixed. Thus, when v is large and greater than ex/2, J u (x) 
is small and | V^(rr) | is large. 

Miller's algorithm seems to be the most effective method in the region 
where the power series (|4.49|) suffers from catastrophic cancellation but asymp- 
totic expansions are not sufficiently accurate. For more on Miller's algorithm, 
see the Notes and References. 

4.7.2 Evaluation of Bernoulli and Tangent numbers 

In Section l4~5| equations ()4.36|) and ()4.39|) . the Bernoulli numbers B 2 k or 
scaled Bernoulli numbers Ck = B2k/{2k)\ were required. These constants 
can be defined by the generating functions 

OO J, 

y B k %- = —^-, (4.57) 

^ k\ e x -l v ' 

k=0 



y Ck x 2k = -?— + | = f( 2 . o . . (4.58) 

^ fe e x - 1 2 tanhfx/2) v ; 



x i x x/2 

tanh(x/2) 

Multiplying both sides of ()4.57|) or ()4.58|) by e x — 1 and equating coefficients 
gives the recurrence relations 



B = l, J2[ . ) Bj = for m > 

„•— n \ J / 



(4.59) 



i=o 

and 

V ^ = 1 (4.60) 

j^(2k+l-2 3 )\ 2(2*)! l j 

These recurrences, or slight variants with similar numerical properties, have 
often been used to evaluate Bernoulli numbers. 
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In this chapter our philosophy is that the required precision is not known 
in advance, so it is not possible to precompute the Bernoulli numbers and 
store them in a table once and for all. Thus, we need a good algorithm for 
computing them at runtime. 

Unfortunately, forward evaluation of the recurrence ()4.59|) . or the corre- 
sponding recurrence (|4.60|) for the scaled Bernoulli numbers, is numerically 
unstable: using precision n the relative error in the computed i?2fc or Ck is 
of order 4 fc 2~ n : see Exercise 14.351 

Despite its numerical instability, use of (|4.6U|) may give the Cf. to accept- 
able accuracy if they are only needed to generate coefficients in an Euler- 
Maclaurin expansion whose successive terms diminish by at least a factor of 
four (or if they are computed using exact rational arithmetic). If the Ck are 
required to precision n, then (|4.6(J|) should be used with sufficient guard dig- 
its, or (better) a more stable recurrence should be used. If we multiply both 
sides of ()4.58|) by smh(x/2)/x and equate coefficients, we get the recurrence 



V S = \ (4 61) 

^ (2k + l-2j)\4 k -i (2k)U k K ' 

If (J4.61J) is used to evaluate Ck, using precision n arithmetic, the relative 
error is only 0(k 2 2~ n ). Thus, use of (I4.61J1 gives a stable algorithm for eval- 
uating the scaled Bernoulli numbers Ck (and hence, if desired, the Bernoulli 
numbers) . 

An even better, and perfectly stable, way to compute Bernoulli numbers 
is to exploit their relationship with the tangent numbers Tj, defined by 

tanx = > T -. (4.62) 

The tangent numbers are positive integers and can be expressed in terms of 
Bernoulli numbers: 

Tj = (-ly-^ (2 2j - 1) ^ . (4.63) 

2j 

Conversely, the Bernoulli numbers can be expressed in terms of tangent num- 
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bers: 

1 if i = o, 

-1/2 if J = 1, 

(_1)j72-i jTj/2 /( 4 j _ 2 i) if j > is even, 

otherwise. 



5, 



Equation ()4.63|) shows that the odd primes in the denominator of the Bernoulli 
number B 2 j must be divisors of 2 2j — 1. In fact, this is a consequence of the 
Von Staudt- Clausen theorem, which says that the primes p dividing the de- 
nominator of B 2 j are precisely those for which (p — 1) 1 2n (see the Notes and 
References) . 

Algorithm 52 Tangent-numbers 

Input: positive integer m 

Output: Tangent numbers T l7 . . . , T m 

Ti^l 

for k from 2 to m do 
T k <- (A; - 1) * T fc _! 

for fc from 2 to m do 
for j from fc to m do 

T . ^ ^ _ fc) * r i _ 1 + (i _ k + 2 ) * t 3 

Return T u T 2 ,...,T m 

We now derive a recurrence that can be used to compute tangent numbers, 
using only integer arithmetic. For brevity write t = tana; and D = d/dx. 
Then Dt = sec 2 x = 1+t 2 . It follows that D(t n ) = nt n ~ l {l + t 2 ) for all 
n eN. 

It is clear that D n t is a polynomial in t, say P n {t). For example, -Po(i) = ^ 
P\{t) = 1 + £ 2 , etc. Write P n (t) = J2j>oPn,j tJ - From the recurrence P n (t) = 
DP n _i(t), we see that deg(P n ) = n+ 1 and 

j>0 j>0 

so 

Pn,j = (j - l)p n -l,j-l + (j + l)p n -l,j+l (4.64) 

for all n £ N*. Using (|4.64|) it is straightforward to compute the coefficients 
of the polynomials Pi(t), P2{t), etc. 
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Observe that, since tana; is an odd function of x, the polynomials P2k{t) 
are odd, and the polynomials P2k+i(f) ar e even. Equivalently, p n j = if 
n + j is even. 

We are interested in the tangent numbers Tk = P2fe-i(0) = P2k-i,o- Using 
the recurrence (14.641) but avoiding computation of the coefficients that are 
known to vanish, we obtain Algorithm Tangent-numbers for the in-place 
computation of tangent numbers. Note that this algorithm uses only arith- 
metic on non-negative integers. If implemented with single- precision integers, 
there may be problems with overflow as the tangent number grow rapidly. If 
implemented using floating-point arithmetic, it is numerically stable because 
there is no cancellation. An analogous algorithm Secant-numbers is the 
topic of Exercise I4.4UI 

The tangent numbers grow rapidly because the generating function tan x 
has poles at x = ±7r/2. Thus, we expect Tk to grow roughly like 
(2k - 1)! (2/7r) 2fc . More precisely, 

T k 2 2fc + 1 (l-2- 2fc )C(2fc) 

(2fc-l)! vr 2fc ' l ' 

where C,(s) is the usual Riemann zeta-function, and 

(l-2- s )C(s) = l + 3" s + 5- s + --- 

is sometimes called the odd zeta-function. 

The Bernoulli numbers also grow rapidly, but not quite as fast as the 
tangent numbers, because the singularities of the generating function (J4.57J) 
are further from the origin (at ±2i7r instead of ±7r/2). It is well-known that 
the Riemann zeta-function for even non-negative integer arguments can be 
expressed in terms of Bernoulli numbers - the relation is 



:-i B 2k 2((2k) 
(2k)\ (2^ 2k 



1)^7=7^ = 7^i- (4-66) 



l*»l~;S£. ( 4 ^) 



Since ((2k) = 1 + 0(4 fe ) as k — ► +oo, we see that 

2(2k)\ 

W) 

It is easy to see that ()4.65j) and (|4.66|) are equivalent, in view of the rela- 
tion KGfy . 

For another way of computing Bernoulli numbers, using very little space, 
see gZEDl 
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4.8 Arithmetic-Geometric Mean 

The fastest known methods for very large precision n are based on the 
arithmetic-geometric mean (AGM) iteration of Gauss. The AGM is another 
nonlinear recurrence, important enough to treat separately. Its complexity is 
0(M(n) Inn); the implicit constant here can be quite large, so other methods 
are better for small n. 

Given (ao,6o)> the AGM iteration is defined by 



{a j+1 ,b j+1 ) = I aj 2 \ y/a~b~) 



For simplicity we only consider real, positive starting values (oo, &o) here (for 
complex starting values, see A4.8.5I and the Notes and References). The AGM 
iteration converges quadratically to a limit which we denote by AGM(ao, bo). 
The AGM is useful because: 

1. It converges quadratically. Eventually the number of correct digits 
doubles at each iteration, so only O(logn) iterations are required. 

2. Each iteration takes time 0(M(n)) because the square root can be 
computed in time 0(M(n)) by Newton's method (see ^3 .51 and i J4.2.3|) . 

3. If we take suitable starting values (ao, bo), the result AGM(ao, bo) can be 
used to compute logarithms (directly) and other elementary functions 
(less directly), as well as constants such as it and In 2. 

4.8.1 Elliptic Integrals 

The theory of the AGM iteration is intimately linked to the theory of elliptic 
integrals. The complete elliptic integral of the first kind is defined by 

K(k)= / - / - 4.68 

Jo ^/l-k 2 sm 2 9 Jo V(l-* 2 )(l-fc 2 * 2 ) 

and the complete elliptic integral of the second kind is 

E{k)= Vl-k 2 sm 2 6d6= J- dt , 

Jo Jo » 1 — * 

where k E [0, 1] is called the modulus and k' = \/l — k 2 is the complementary 
modulus. It is traditional (though confusing as the prime does not denote 
differentiation) to write K'(k) for K(k') and E'(k) for E(k'). 
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The Connection With Elliptic Integrals. Gauss discovered that 

2 K'{k). (4.69) 



AGM(l,fc) it 

This identity can be used to compute the elliptic integral K rapidly via 
the AGM iteration. We can also use it to compute logarithms. From the 
definition f)4.68|) . we see that K(k) has a series expansion that converges for 
\k\ < 1 (in fact K{k) = f-F 1 (|, §; 1; k 2 ) is a hypergeometric function). For 
small k we have 

#(*) = | (l + ^ + 0(* 4 )) . (4.70) 

It can also be shown that 

K\k) = 2 - In (i) K(k) - £ + 0(k% (4.71) 

4.8.2 First AGM Algorithm for the Logarithm 

From the formulae ()4.69|) . ()4.70|) and (|4.71|) . we easily get 

s ^ = h (i)(l +0 (t")). (4.72) 

Thus, if x = A/k is large, we have 



' nW = AGM(l 2 4A) ( 1+ °(^ 

If x > 2 n ' 2 , we can compute ln(x) to precision n using the AGM iteration. 
It takes about 21g(n) iterations to converge if x £ [2 n ' 2 ,2 n ]. 

Note that we need the constant it, which could be computed by using 
our formula twice with slightly different arguments X\ and X2, then taking 
differences to approximate (d\ri(x) / dx) / it at x\ (see Exercise I4.43J) . More 
efficient is to use the Brent-Salamin algorithm, which is based on the AGM 
and the Legendre relation 

EK' + E'K - KK' = -. (4.73) 
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Argument Expansion. If x is not large enough, we can compute 

ln(2 fc x) = k In 2 + In a; 

by the AGM method (assuming the constant In 2 is known). Alternatively, 
if x > 1 , we can square x enough times and compute 



\n(x 2k ) =2 fe ln(x) 



This method with x = 2 gives a way of computing In 2, assuming we already 
know it. 

The Error Term. The 0(k 2 ) error term in the formula (|4.72[) is a nuisance. 
A rigorous bound is 



V2 ]n (A 



AGM(l,Jfe) \k 



<4fc 2 (8-ln£;) (4.74) 



for all k G (0, 1], and the bound can be sharpened to 0.37fc 2 (2.4 — ln(fc)) if 
ke (0,0.5]. 

The error 0(k 2 \ lnk\) makes it difficult to accelerate convergence by using 
a larger value of k (i.e., a value of x = A/k smaller than 2"/ 2 ). There is an 
exact formula which is much more elegant and avoids this problem. Before 
giving this formula we need to define some theta functions and show how 
they can be used to parameterise the AGM iteration. 

4.8.3 Theta Functions 

We need theta functions #2(9), ^3(9) and O^q), defined for \q\ < 1 by: 

+00 +00 

6 2 (q) = Yl q (n+1 ^=2q^J2 ( l n{n+1) > 

n=—oo n=0 

+00 +00 

3 (q) = J2 9 n2 = l + 2]Tg n2 , (4.75) 

n=— 00 n=l 

+00 

(q) = e 3 (-q) = l + 2j2(-lT<l n2 - (4-76) 



'A 

n=l 
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Note that the defining power series are sparse so it is easy to compute #2(9) 
and 6s(q) for small q. Unfortunately, the rectangular splitting method of 
i j4.4.3l does not help to speed up the computation. 

The asymptotically fastest methods to compute theta functions use the 
AGM. However, we do not follow this trail because it would lead us in circles! 
We want to use theta functions to give starting values for the AGM iteration. 

Theta Function Identities. There are many classical identities involving 
theta functions. Two that are of interest to us are: 

Ql(g) + 01(g) 
2 

The latter may be written as 



Qitf) and e 3 (q)9 4 (q)=6i(q< 



y/e&q)9l(q) = B\{q< 



to show the connection with the AGM: 

AGM(# 3 2 (g)A 2 (g)) = AGM(0 3 V) , d\(q 2 )) = • • • 

= agm(# 3 V)aV)) = --- = l 

for any \q\ < 1. (The limit is 1 because q 2 converges to 0, thus both 6*3 and 
64 converge to 1.) Apart from scaling, the AGM iteration is parameterised 
by (W^AV*)) for k = 0,1,2,... 

The Scaling Factor. Since AGM(6 2 (q) , 0|(g)) = 1, and AGM(Aa, Aft) = 
A AGM(a, b), scaling gives AGM(1, k') = l/0|(g) if k! = 9 2 4 (q) / 9 2 (q) . Equiva- 
lently, since Q\ + 9\ = #| (Jacobi), k = 2 (q)/0 2 {q). However, we know (from 
dHH) with jfe -> jfe') that l/AGM(l,fc') = 2K(k)/ir, so 

K(k) = \el{q). (4-77) 

Thus, the theta functions are closely related to elliptic integrals. In the 
literature q is usually called the nome associated with the modulus k. 

From q to k and k to q. We saw that k = 0|(g) / '6 2 {q) ', which gives k in 
terms of q. There is also a nice inverse formula which gives q in terms of k: 
q = exp(—irK'(k)/K(k)), or equivalently 

ln U) = w- (4 - 78) 
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Sasaki and Kanada's Formula. Substituting (J4.69J) and (|4.77|) with 
k = ^l(?)/^i(?) i n t° (|4.78|) gives Sasaki and Kanada's elegant formula: 



1\ 7T 

q) = AGM(9l( q ),9l(q)y 



ln ( ~ ) = ^777^ 2|57-xv ( 4 - 79 ) 



This leads to the following algorithm to compute lnx. 

4.8.4 Second AGM Algorithm for the Logarithm 

Suppose x is large. Let q = 1/x, compute ^(q 4 ) and 6s(q 4 ) from their 
defining series (l4~75| and (J4~7H|) . then compute AGM(^(g 4 ),6»|(g 4 )). Sasaki 
and Kanada's formula (with q replaced by q 4, to avoid the q 1//4 term in the 
definition of #2(5)) gives 

ln(x) - n/4 



AGM{6*tf),e$tf)) 



There is a trade-off between increasing x (by squaring or multiplication by 
a power of 2, see the paragraph on "Argument Expansion" in < 34.8.2|) . and 
taking longer to compute 6* 2 (g 4 ) and 6 3 (q 4 ) from their series. In practice it 
seems good to increase x until q = 1/x is small enough that (3(q 36 ) terms are 
negligible. Then we can use 

e 2 (q 4 )=2(q + q 9 + q 25 + 0(q 49 )), 

6 3 (q 4 ) = l+2{q 4 + q 16 + 0(q m )). 

We need x > 2 n / 36 which is much better than the requirement x > 2 n//2 for 
the first AGM algorithm. We save about four AGM iterations at the cost of 
a few multiplications. 

Implementation Notes. Since 

AGM{6 2 2 + 6l,26 2 6 3 ) 



AGM(^) 

we can avoid the first square root in the AGM iteration. Also, it only takes 
two nonscalar multiplications to compute 26263 and 6\ + 6\ from 62 and #3: 
compute u = (62 + #3) 2 , v = 6263, then 2v and u — 2v are the desired values. 
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Constants. Using Bernstein's algorithm (see A3.8J1 . an n-bit square root 
takes time ^M(n), thus one AGM iteration takes time ^M(n). The AGM 
algorithms require 2 lg(n) + 0(1) AGM iterations. The total time to compute 
ln(x) by the AGM is ~ f \g(n)M(n). 

Drawbacks of the AGM. The AGM has three drawbacks: 

1. The AGM iteration is not self- correcting, so we have to work with full 
precision (plus any necessary guard digits) throughout. In contrast, 
when using Newton's method or evaluating power series, many of the 
computations can be performed with reduced precision, which saves a 
logn factor. 

2. The AGM with real arguments gives ln(x) directly. To obtain exp(x) 
we need to apply Newton's method ( 34.2.5|) . To evaluate trigonometric 
functions such as sin(x), cos(x), arctan(x) we need to work with com- 
plex arguments, which increases the constant hidden in the "O" time 
bound. Alternatively, we can use Landen transformations for incom- 
plete elliptic integrals, but this gives even larger constants. 

3. Because it converges so fast, it is difficult to speed up the AGM. At 
best we can save 0(1) iterations. 

4.8.5 The Complex AGM 

In some cases the asymptotically fastest algorithms require the use of complex 
arithmetic to produce a real result. It would be nice to avoid this because 
complex arithmetic is significantly slower than real arithmetic. Examples 
where we seem to need complex arithmetic to get the asymptotically fastest 
algorithms are: 

1. arctan(x), arcsin(x), arccos(x) via the AGM, using, for example, 

arctan(x) = £y(ln(l +ix)); 

2. tan(x), sin(x), cos(x) using Newton's method and the above, or 

cos(x) + isin(x) = exp(ix), 

where the complex exponential is computed by Newton's method from 
the complex logarithm (see Eq. (|4.11|) ). 
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The theory that we outlined for the AGM iteration and AGM algorithms 
for \n(z) can be extended without problems to complex z £ (—00, 0], provided 
we always choose the square root with positive real part. 

A complex multiplication takes three real multiplications (using Karat- 
suba's trick), and a complex squaring takes two real multiplications. We can 
do even better in the FFT domain, if we assume that one multiplication of 
cost M(n) is equivalent to three Fourier transforms. In this model a squaring 
costs |M(n). A complex multiplication (a+ib)(c+id) = (ac — bd) + i(ad+bc) 
requires four forward and two backward transforms, thus costs 2M(n). A 
complex squaring (a + ib) 2 = (a + b) (a — b) + i(2ab) requires two forward and 
two backward transforms, thus costs |M(n). Taking this into account, we 
get the following asymptotic upper bounds (0.666 should read 2/3, and so 
on): 

Operation real complex 



(4.80) 



squaring 


0.666M(n) 


1.333M(n) 


multiplication 


M{n) 


2M(n) 


division 


2.0833M(n) 


6.5M(n) 


square root 


1.8333M(n) 


6.333M(n) 


AGM iteration 


2.8333M(n) 


8.333M(n) 


log via AGM 


5.666 \g(n)M(n) 


16.666 lg(n)M(n) 



See the Notes and References for details of the algorithms giving these 
constants. 



4.9 Binary Splitting 

Since the asymptotically fastest algorithms for arctan, sin, cos, etc have 
a large constant hidden in their time bound 0{M(n) log n) (see paragraph 
"Drawbacks of the AGM" in fc !4.8.4|) . it is interesting to look for other algo- 
rithms that may be competitive for a large range of precisions, even if not 
asymptotically optimal. One such algorithm (or class of algorithms) is based 
on binary splitting or the closely related FEE method (see the Notes and 
References). The time complexity of these algorithms is usually 

O ((log n) a M(n)) 
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for some constant a > 1 depending on how fast the relevant power series 
converges, and also on the multiplication algorithm (classical, Karatsuba or 
quasi-linear). 

The Idea. Suppose we want to compute arctan(x) for rational x = p/q, 
where p and q are small integers and \x\ < 1/2. The Taylor series gives 

fp\ v-^ (-l)V J+1 

arctan - ~ > t^ „.,, . 

\qj ^ (2j + l)q 2 i +1 

0<j<n/2 V J ,H 

The finite sum, if computed exactly, gives a rational approximation P/Q to 
arctan(p/g), and 

log |Q| = 0(n logn). 

(Note: the series for exp converges faster, so in this case we sum ~ n/lnn 
terms and get log \Q\ = 0{n).) 

The finite sum can be computed by the "divide and conquer" strategy: 
sum the first half to get P\/Q\ say, and the second half to get P2/Q2, then 

P Pi P2 P1Q2 + P2Q1 



Q Qi Q2 Q1Q2 

The rationals P\jQ\ and P2/Q2 are computed by a recursive application of 
the same method, hence the term "binary splitting" . If used with quadratic 
multiplication, this way of computing P/Q does not help; however, fast mul- 
tiplication speeds up the balanced products P1Q2, P2Q1, and QiQ2- 

Complexity. The overall time complexity is 

(E M (J) lo s(^)) =0((lognrM(n)), 
\k>i / 

where a = 2 in the FFT range; in general a < 2. 

We can save a little by working to precision n rather than n log n at the 
top levels; but we still have a = 2 for quasi-linear multiplication. 

In practice the multiplication algorithm would not be fixed but would 
depend on the size of the integers being multiplied. The complexity would 
depend on the algorithm(s) used at the top levels. 
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Repeated Application of the Idea. If x G (0, 0.25) and we want to 
compute arctan(x), we can approximate x by a rational p/q and compute 
arctan(p/g) as a first approximation to arctan(x), say p/q < x < (p + l)/q. 
Now, from (fQ7|> . 

tan(arctan(x) — arctan(p/g)) 



so 



where 



1 + px/q ' 

arctan(x) = arctan(p/g) + arctan(5), 

x-p/q qx-p 

o = — = . 

1 + px/q q + px 



We can apply the same idea to approximate arctan(^), until eventually we get 
a sufficiently accurate approximation to arctan(x). Note that \8\ < \x — p/q\ 
< 1/q, so it is easy to ensure that the process converges. 

Complexity of Repeated Application. If we use a sequence of about 
lgn rationals Pi/qi,P2/q2, • • •, where 

g, = 2 2 ', 

then the computation of each arctan(pj/gj) takes time O ((log n) a M(n)), and 
the overall time to compute arctan(x) is 

O ((log n) a+1 M(n)). 

Indeed, we have < pi < 2 2 * , thus pt has at most 2* _1 bits, and Pi/qi as a 
rational has value 0(2 -2 ' ) and size 0(2*). The exponent a + 1 is 2 or 3. 
Although this is not asymptotically as fast as AGM-based algorithms, the 
implicit constants for binary splitting are small and the idea is useful for 
quite large n (at least 10 6 decimal places). 

Generalisations. The idea of binary splitting can be generalised. For ex- 
ample, the Chudnovsky brothers gave a "bit-burst" algorithm which applies 
to fast evaluation of solutions of linear differential equations. This is de- 
scribed in ^4.9.21 
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4.9.1 A Binary Splitting Algorithm for sin, cos 

In |361 Theorem 6.2], the first author claims an 0(M(n) log n) algorithm 
for exprr and sin a;, however the proof only covers the case of the expo- 
nential, and ends with "the proof of (6.28) is similar". The author had in 
mind deducing sin a; from a complex computation of exp(ix) = cos a; + i sin a;. 
Algorithm SinCos is a variation of Brent's algorithm for exprr that com- 
putes sin x and cos x simultaneously, in a way that avoids computations with 
complex numbers. The simultaneous computation of sin a; and cos a; might 
be useful, for example, to compute tana; or a plane rotation through the 
angle x. At Step |21 of Algorithm SinCos, we have Xj = yj + Xj + i, thus 

Algorithm 53 SinCos 

Input: floating-point < x < 1/2, integer n 

Output: an approximation of sin a; and cos a; with error 0(2~ n ) 
1: Write x « J2i=oPi ' 2~ 2 ' +1 where < pi < 2 2 ' and k = \\gn] — 1 
2: Let Xj = J2i=jPi ' 2_2 ' + > wit h x k+i = 0, and yj = pj ■ 2~ 2J+ 
3: (Sfc+i, C fc+1 ) <— (0, 1) > Sj is sinxj and Cj is cosa^- 

4: for j = k, k — 1, . . . , 1, do 

5: Compute sinyj and cosyj using binary splitting 
6: Sj <— sin yj ■ Cj + i + cos yj ■ Sj+i, Cj <— cos yj • Cj +1 — sin yj ■ Sj+i 

7: return (So, Co) 



sinxj = sinyj cos Xj + i + cosyj sinxj + i, and similarly for cosXj, explaining 
the formulae used at Step El Step El uses a binary splitting algorithm similar 
to the one described above for arctan(p/g): yj is a small rational, or is small 
itself, so that all needed powers do not exceed n bits in size. This algorithm 
has the same complexity 0(M(n) log 2 n) as Brent's algorithm for expo;. 

4.9.2 The Bit-Burst Algorithm 

The binary-splitting algorithms described above for arctanx, exprr, sin a: 
rely on a functional equation: tan(a: + y) = (tana; + tany)/(l — tana;tany), 
exp(x + y) = exp(a;) exp(y), sin(a; + y) = sin a; cosy + siny cos a;. We describe 
here a more general algorithm, known as the "bit-burst" algorithm, which 
does not require such a functional equation. This algorithm applies to the 
so-called "D-finite" functions. 
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A function f(x) is said to be D-finite or holonomic iff it satisfies a lin- 
ear differential equation with polynomial coefficients in x. Equivalently, the 
Taylor coefficients Uk of / satisfy a linear recurrence with polynomial coeffi- 
cients in k. For example, the exp, In, sin, cos functions are D-finite, but tan is 
not. An important subclass of D-finite functions is the hypergeometric func- 
tions, whose Taylor coefficients satisfy an homogeneous recurrence of order 1: 
Uk+i/uk = R(k) where R(k) is a rational function of k (see £J4.4|) . However, 
D-finite functions are much more general than hypergeometric functions (see 
Ex. l4.45|) : in particular the ratio of two consecutive terms in a hypergeomet- 
ric series has size 0(\ogk) (as a rational number), but can be much larger 
for D-finite functions. 

Theorem 4.9.1 If f is D-finite and has no singularities on a finite, closed 
interval [A,B], where A < < B and /(0) = 0, then f(x) can be com- 
puted to an (absolute) accuracy of n bits, for any n-bit floating-point number 
x E [A,B], in time 0(M(n) log 3 n). 

Note: the condition /(0) = is just a technical condition to simplify the 
proof of the theorem; /(0) can be any value that can be computed to n bits 
in time 0(M{n) log n). 

Proof. Without loss of generality, we assume < x < 1 < B; the binary 
expansion of x can then be written x = 0.bib 2 . . . b n . Define r\ = O.&i, 
f2 = O.O&263, i"3 = 0.000646566^7 (the same decomposition was already used 
in Algorithm SinCos): r\ consists of the first bit of the binary expansion of 
x, r 2 consists of the next two bits, r 3 the next four bits, and so on. We thus 
have x = r\+ r 2 + ■ ■ ■ + r^ where 2 k ~ 1 < n <2 k . 

Define Xi = r\ + • • • + fj with xq = 0. The idea of the algorithm is to 
translate the Taylor series of / from x% to £j+i; since / is D-finite, this re- 
duces to translating the recurrence on the corresponding coefficients. The 
condition that / has no singularity in [0, x] C [A, B] ensures that the trans- 
lated recurrence is well-defined. We define f (t) = f(t), /i(t) = /o( r i + t), 
h{t) = /i(r 2 +t), . . . , fiit) = fi-^n+t) for i < k. We have f t (t) = f(x t +t), 
and fk(t) = fix + 1) since Xk = x. Thus we are looking for /fe(0) = fix). 

Let /*(£) = fi(t) — fi(0) be the non-constant part of the Taylor expansion 
of U We have f*(r i+1 ) = h{r i+1 ) - /,(0) = / i+1 (0) - /,(0) since f i+1 (t) = 
fi(r i+1 + t). Thus /*(n) + • • • + fU(r k ) = (^(0) - / (0)) + • • • + (/ fc (0) - 
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/fc-i(O)) = / fc (0) - / (0) = f(x) - /(O). Since /(O) = 0, this gives: 

fe-i 
f(x) = J2mr i+1 ). 



i=0 



To conclude the proof, we will show that each term /*(rj+i) can be evaluated 



.2 



to n bits in time 0(M(n) log n). The rational j-j+i has a numerator of at 
most 2 % bits, and 

0<r m <2 1 " 2 '. 

Thus, to evaluate /*(rj+i) to n bits, n/2 l + O(logn) terms of the Taylor 
expansion of f*(t) are enough. We now use the fact that / is D-finite. 
Assume / satisfies the following homogeneous linear differential equation 
with polynomial coefficient^]: 

c m (t)f {m) (t) + ■■■+ ci(t)/'(t) + c (t)f(t) = 0. 

Substituting Xi + t for t, we obtain a differential equation for ff. 

c m (xi + t)fl m \t) + ■ ■ ■ + c^ + *)/<(*) + cote + t)fi(t) = 0. 

From this equation we deduce (see the Notes and References) a linear re- 
currence for the Taylor coefficients of fi(t), of the same order as that for 
/(£). The coefficients in the recurrence for fi(t) have 0(2*) bits, since 
Xi = ri + • • • + r i has 0(2*) bits. It follows that the £-ih Taylor coefficient of 
fi(t) has size 0(£(2 l + log^)). The £log-^ term comes from the polynomials 
in £ in the recurrence. Since £ < nj2 % + O(logn), this is 0(n log n). 

However, we do not want to evaluate the £-th Taylor coefficient u& of fi(t), 
but the series 

3=1 

Noting that Ug = (s£— s^_i)/rf +1 , and substituting this value in the recurrence 
for (ug), say of order d, we obtain a recurrence of order d+ 1 for (s^). Putting 
this latter recurrence in matrix form Sg = M^S^i, where Si is the vector 
(s e , s e -i, • • • , s e _ d ), we obtain 

S e = M e M e _ 1 ---M d+1 S d , (4.81) 



lx If / satisfies a non-homogeneous differential equation, say E(t, f(t), fit), ■ ■ ■) = b(t), 
where b(t) is polynomial in t, then differentiation deg(6) + 1 times yields an homogeneous 
equation. 
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where the matrix product M^M^i ■ ■ ■ M^+i can be evaluated in 0(M{n) log 2 n) 
using binary splitting. rj 

We illustrate Theorem 14. 9. II with the arc-tangent function, which satisfies 
the differential equation f'(t)(l + 1 2 ) = 1. This equation evaluates at x^ + 1 
to fl(t)(l + (xi + 1) 2 ) = 1, where fi(t) = f(xi + 1). This gives the recurrence 

(1 + x 2 )£u e + 2xi(£ - lK_i + (£- 2)u e _ 2 = 

for the Taylor coefficients u^ of /j. This recurrence translates to 

(1 + x 2 )£v e + 2x t r i+1 {£ - l)v e ^ + r 2 +l {£ - 2)v £ _ 2 = 

for V£ = uirf +1 , and to 

(1 + x 2 )£{s e - s e _ 1 )+2x i r z+1 {£-l){s e _ 1 - s e _ 2 ) +r 2 +l {£-2){s e _ 2 - s e _ 3 ) = 

for si = ^2 j=1 Vj. This recurrence of order 3 can be written in matrix form, 
and Eq. ()4.81|) enables one to efficiently compute S£ m fifa + 1) — /i(0) using 
multiplication of 3 x 3 matrices and fast integer multiplication. 

4.10 Contour Integration 

In this section we assume that facilities for arbitrary-precision complex arith- 
metic are available. These can be built on top of an arbitrary-precision real 
arithmetic package (see Chapters El and 0) . 

Let f(z) be holomorphic in the disc \z\ < R, R > 1, and let the power 
series for / be 

oo 

f(z) = J2 a-i J ■ (4-82) 

3=0 

From Cauchy's theorem jHZl Ch. 7] we have 

a 3 = ^- [ ^rdz, (4.83) 

3 2m J c Z1+ 1 v ; 

where C is the unit circle. The contour integral in (J4.83J) may be approxi- 
mated numerically by sums 



S jik = - J2 f( e 27rim / k )e- 2wijm/k . (4.84) 



m=0 
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Let C be a circle with centre at the origin and radius p G (1,-R). From 
Cauchy's theorem, assuming that j < k, we have (see Exercise I4.46|) : 

Sj ' k ~ aj = 2ri j c , (z k - l)^+i dZ = aj+k + aj+2k + " ' ' (4 ' 85) 

so liS^fc — o_j| = 0((R — 5)~v +k >) as k — > oo, for any 5 > 0. For example, let 

be the generating function for the scaled Bernoulli numbers as in (I4.58J) . so 
<i2j = Cj = B<2j/(2j)\ and R = 2-k (because of the poles at ±27r«). Then 



B-2J R-2j+k E>2j+2k 

" J2]y. ~ (2j + k)\ + (2 3 + 2k)\ 



±J2q ^>2j+k ^>2j+2k ,. Q7 , 



so we can evaluate B 2 j with relative error 0((27r) _fc ) by evaluating f(z) at k 
points on the unit circle. 

There is some cancellation when using (|4.84|) to evaluate SV,^ because 
the terms in the sum are of order unity but the result is of order (27r) _2j . 
Thus 0(j) guard digits are needed. In the following we assume j = 0(n). 

If exp(— 2-Kijm/k) is computed efficiently from exp(— 2iri/k) in the obvi- 
ous way, the time required to evaluate B 2 , . . . , B 2 j to precision n is 0(jnM(n)), 
and the space required is 0(n). We assume here that we need all Bernoulli 
numbers up to index 2j, but we do not need to store all of them simultane- 
ously. This is the case if we are using the Bernoulli numbers as coefficients 
in a sum such as (J4.39J1 . 

The recurrence relation method of ^4.7.21 is faster but requires space 
Q(jn). Thus, the method of contour integration has advantages if space 
is critical. 

For comments on other forms of numerical quadrature, see the Notes and 
References. 



4.11 Exercises 

Exercise 4.1 If A(x) = ^ 7 >o a i' cJ ^ s a f° rma l power series over K with ao = 1, 
show that ln(-A(a;)) can be computed with error 0(x n ) in time 0(M(n)), where 
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M(n) is the time required to multiply two polynomials of degree n — 1. Assume a 
reasonable smoothness condition on the growth of M(n) as a function of n. [Hint: 
(d/dx)ln(A(x)) = A'(x)/A(x)] Does a similar result hold for n-bit numbers if x 
is replaced by 1/2? 

Exercise 4.2 (Schost) Assume one wants to compute l/s(x) mod x n , for s(x) a 
power series. Design an algorithm using an odd-even scheme ( fll.3.5j) . and estimate 
its complexity in the FFT range. 

Exercise 4.3 Suppose that g and h are sufficiently smooth functions satisfying 
g(h{x)) = x on some interval. Let yj = h(xj). Show that the iteration 



fc-i 



c j+1 = Xj+ ^(y-Vj) 



g {m) {yj) 



mi 

ra=l 

is a fc-th order iteration that (under suitable conditions) will converge to x = g(y). 
[Hint: generalise the argument leading to 1)4.161) .] 

Exercise 4.4 Design a Horner-like algorithm for evaluating a series Y2j=o a j x:> in 
the forward direction, while deciding dynamically where to stop. 

Exercise 4.5 Assume one wants n bits of exp x for x of order 2? , with the repeated 
use of the doubling formula ( ^4.3.1)1 . and the naive method to evaluate power series. 
What is the best reduced argument x/2 k in terms of n and j? [Consider both cases 
j > and j < 0.] 

Exercise 4.6 Assuming one can compute an n-bit approximation to lnx in time 
T(n), where n <C M(n) = o(T(n)), show how to compute an n-bit approxima- 
tion to expx in time ~ Tin). Assume that T{n) and M(n) satisfy reasonable 
smoothness conditions. 

Exercise 4.7 Care has to be taken to use enough guard digits when computing 
exp(x) by argument reduction followed by the power series l)4.21|) . If x is of order 
unity and k steps of argument reduction are used to compute exp(x) via 

exp(x) = (ex.p(x/2 k ) 

show that about k bits of precision will be lost (so it is necessary to use about k 
guard bits). 
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Exercise 4.8 Show that the problem analysed in Ex. 14.71 can be avoided if we 
work with the function 



expml (x) = exp(x) — 1 = > — 

1 J ' 



which satisfies the doubling formula 

expml(2x) = expml(x)(2 + expml(a;)). 
Exercise 4.9 For x > — 1, prove the reduction formula 

loglp (z) = 21oglp — 

\1 + VI + x 

where the function loglp(:c) is defined, as in ^4.4.21 by 

loglp(x) = ln(l + x). 

Explain why it might be desirable to work with loglp instead of In in order to avoid 
loss of precision (in the argument reduction, rather than in the reconstruction as 
in Ex. 14. 7|) . Note however that argument reduction for loglp is more expensive 
than that for expml, because of the square root. 

Exercise 4.10 Give a numerically stable way of computing sinh(x) using one 
evaluation of expml (| x |) and a small number of additional operations (compare 
Eq. CEP). 

Exercise 4.11 (White) Show that exp(x) can be computed via sinh(:c) using 
the formula 



exp(x) = sinh(x) + y 1 + sinh (x) . 
Since 

sinh(x) = = > — , 

V ; 2 ^^ (2k + 1)1 

k>0 v ; 

this saves computing about half the terms in the power series for exp(x) at the 
expense of one square root. How would you modify this method to preserve nu- 
merical stability for negative arguments x? 

Exercise 4.12 Count precisely the number of nonscalar products necessary for 
the two variants of rectangular series splitting ( 84.4.3JI . 
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Exercise 4.13 A drawback of rectangular series splitting as presented in ^4.4.31 
is that the coefficients (ake+m i n the classical splitting, or a,j m +l in the modular 
splitting) involved in the scalar multiplications might become large. Indeed, they 
are typically a product of factorials, and thus have size O(dlogd). Assuming that 
the ratios aj+i/aj are small rationals, propose an alternate way of evaluating P{x). 

Exercise 4.14 Make explicit the cost of the slowly growing function c(d) f ^4. 4.3(1 . 

Exercise 4.15 Prove the remainder term ((4.28(1 in the expansion ((4.27(1 for Ei(x). 
[Hint: prove the result by induction on k, using integration by parts in the for- 
mula (OHl.l 



Exercise 4.16 Show that we can avoid using Cauchy principal value integrals by 
defining Ei(z) and Ei(z) in terms of the entire function 

Jo t £j j!j 

Exercise 4.17 Let E\(x) be defined by ((4.25|) for real x > 0. Using ((4.27(1 . show 
that 

5- < e x E 1 (x) < -. 

X X z X 

Exercise 4.18 In this exercise the series are purely formal, so ignore any questions 
of convergence. Applications are given in exercises I4.19M4~2"U1 

Suppose that (aj)j e ^ is a sequence with exponential generating function s(z) = 

Ejlo^T- Suppose that A n = ^™=o (j) a j> and let s ( z ) = Ejlo ~)f" be the 
exponential generating function of the sequence (A n ) n& fq. Show that 

S(z) = exp(z)s(z) . 

Exercise 4.19 The power series for Ein(z) given in Exercise 14.161 suffers from 
catastrophic cancellation when z is large and positive (like the series for exp(— z)). 
Use Exercise 14.181 to show that this problem can be avoided by using the power 
series (where H n denotes the n-th harmonic number) 



e z Ein(z) = ^ 



3 H^ 






Exercise 4.20 Show that the formula ()4,23() for erf (x) follows from formula ((4.22(1 
[Hint: This is similar to Exercise 14. 191 ] 
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Exercise 4.21 Give an algorithm to evaluate T(x) for real x > 1/2, with guar- 
anteed relative error 0(2~ n ). Use the method sketched in SJ4.5I for lnr(x). What 
can you say about the complexity of the algorithm? 

Exercise 4.22 Extend your solution to Exercise 14.211 to give an algorithm to 
evaluate 1/T(z) for z £ C , with guaranteed relative error 0(2~ n ). Note: T(z) has 
poles at zero and the negative integers (that is, for —z £ N), but we overcome this 
difficulty by computing the entire function 1/Y{z). Warning: \T(z)\ can be very 
small if Q(z) is large. This follows from Stirling's asymptotic expansion. In the 
particular case of z = iy on the imaginary axis we have 



21n|r(ty)| = In . . . . « -7r|y| . 
\y smh(7ry) J 

More generally, 

|r(a; + iy)\ 2 « 2Tr\y\ 2x - 1 exp(-7r|y|) 

for x,y £ K and \y\ large. 

Exercise 4.23 The usual form Q4.39JI of Stirling's approximation for ln(T(z)) in- 
volves a divergent series. It is possible to give a version of Stirling's approximation 
where the series is convergent: 

lnr( ^( 2 _i) ln _ + ^ + g__5__ (4 , 8) 

where the constants Cfc can be expressed in terms of Stirling numbers of the first 
kind, s(n, k), defined by the generating function 



y s(n, k)x ' = x(x — 1) • • • (x — n + 1). 



fc=0 

In fact 



1 v^ j\s{n,j) 



2k^(j + l)(j + 2) 

The Stirling numbers \s(n,k)\ can be computed easily from a three-term recur- 
rence, so this gives a feasible alternative to the usual form of Stirling's approxima- 
tion with coefficients related to Bernoulli numbers. 

Show, experimentally and/or theoretically, that the convergent form of Stir- 
ling's approximation is not an improvement over the usual form as used in Exer- 
cise [OT] 
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Exercise 4.24 Implement procedures to evaluate Ei(x) to high precision for real 
positive x, using (a) the power series (|4.26(1 . (b) the asymptotic expansion (|4.27|) 
(if sufficiently accurate), (c) the method of Exercise 14.19) and (d) the continued 
fraction (|4,40|) using the backward and forward recurrences as suggested in 
Determine empirically the regions where each method is the fastest. 

Exercise 4.25 Prove the backward recurrence (14.441). 



Exercise 4.26 Prove the forward recurrence (|4.4 
[Hint: Let 

/ x a l Ofc-1 a k 

yk(x) = • ■ ■ . 

&i+ t»fc-i+ b k + x 

Show, by induction on k > 1, that 

/ N Pk + Pk-lX 1 

Qk + Qk-ix 
Exercise 4.27 For the forward recurrence l|4.45|) . show that 

Qk Qk-i \ ( h l \( b 2 l \ (hi 

P k P fc _i J ~ V ai ) V a 2 J " ' V a k 

holds for k > (and for k = if we define P-i, Q-i appropriately). 

Remark. This gives a way to use parallelism when evaluating continued fractions. 

Exercise 4.28 For the forward recurrence (|4.45|l . show that 



-l) k aia 2 - --ak ■ 



Qk Qk-1 
Pk Pk-l 

Exercise 4.29 Prove the identity 1)4.47)) . 

Exercise 4.30 Prove Theorem 14.6.11 

Exercise 4.31 Investigate using the continued fraction 1)4.41)) for evaluating the 
complementary error function erfc(x) or the error function erf(x) = 1 — erfc(x). 
Is there a region where the continued fraction is preferable to any of the methods 
used in Algorithm Erf of 84 .' 



Exercise 4.32 Show that the continued fraction 1)4.42)) can be evaluated in time 
0(M(k) log k) if the aj and bj are bounded integers (or rational numbers with 
bounded numerators and denominators). 
[Hint: Use Exercise l4~271 ] 
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Exercise 4.33 Instead of (|4.55|) . a different normalisation condition 

oo 

J Q {x) 2 + 2^J v (xf = \ (4.89) 

could be used in Miller's algorithm. Which of these normalisation conditions is 
preferable? 

Exercise 4.34 Consider the recurrence f u -i + fu+l = 2Kfv, where K > is a 
fixed real constant. We can expect the solution to this recurrence to give some 
insight into the behaviour of the recurrence (|4.54|) in the region v « Kx. Assume 
for simplicity that K ^ 1. Show that the general solution has the form 

/„ = A\ v + Bti v , 

where A and /x are the roots of the quadratic equation x 2 — 2Kx +1 = 0, and 
A and B are constants determined by the initial conditions. Show that there are 
two cases: if K < 1 then A and /i are complex conjugates on the unit circle, so 
|A| = |/i| = 1; if K > 1 then there are two real roots satisfying A/i = 1. 

Exercise 4.35 Prove (or give a plausibility argument for) the statements made 
in £14.71 that: (a) if a recurrence based on ()4.60j) is used to evaluate the scaled 
Bernoulli number Cfc, using precision n arithmetic, then the relative error is of 
order 4 fc 2 _n ; and (b) if a recurrence based on (J4.61J) is used, then the relative error 
is 0(k 2 2~ n ). 

Exercise 4.36 Starting from the definition (|4.57(1 . prove Eq. (|4.58|) . Deduce the 
relation ()4.63j) connecting tangent numbers and Bernoulli numbers. 

Exercise 4.37 (a) Show that the number of bits required to represent the tangent 
number T^ exactly is ~ 2/clgfc as k — > oo. (b) Show that the same applies for the 
exact representation of the Bernoulli number I?2fc as a rational number. 

Exercise 4.38 Explain how the correctness of Algorithm Tangent-numbers 

( S14.7.2J1 follows from the recurrence ([4.64)1 . 



Exercise 4.39 Show that the complexity of computing the tangent numbers 
Ti, . . . , T m by Algorithm Tangent-numbers ( i!4.7.2[) is 0(m s log m). Assume that 
the multiplications of tangent numbers Tj by small integers take time O(logTj). 
[Hint. Use the result of Exercise 14.371 ] 
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Algorithm 54 Secant-numbers 

Input: positive integer m 

Output: Secant numbers So, Si, ... , S m 

for k from 1 to m do 

Sk <— k* Sk-i 
for k from 1 to m do 

for j from k + 1 to m do 

s i ^(j-fc)*s i - 1 + (j-fc + i)*s J 

Return So, Si, . . . , S m 



Exercise 4.40 Verify that Algorithm Secant-numbers computes in-place the 
Secant numbers Sk, defined by the generating function 

y^c, X 2k _ 1 

^— ' [2k ! cosx 

fc>0 v ; 



in much the same way that Algorithm Tangent-numbers ( H4.7.2)) computes the 
Tangent numbers. 

Algorithm 55 Exponential-of-series 

Input: positive integer m and real numbers Oi, a-i, ■ ■ ■ , a m 

Output: real numbers bo, b±, . . . , b m such that 

b + hx + • • • + b m x m = exp(aix + • • • + a m x m ) + 0{x rn+1 ) 

b ^l 

for k from 1 to m do 

h <- (Ej=i jajh-j) /k 

Return b , b\, . . . , b m 



Exercise 4.41 (a) Show that Algorithm Exponential-of-series computes B(x) = 
exp(A(x) up to terms of order x m+1 , where A(x) = a\x + a2X 2 + • • • + a m x m is 
input data and B{x) = bo + b\x + • • • + b m x m is the output. 
[Hint: Compare Exercise 14. II ] 

(b) Apply this to give an algorithm to compute the coefficients bk in Stirling's 
approximation for n\ (or T{n + 1)): 



- - (=) 



n! ~ [ — ] v 27rn 

fc>0 
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[Hint: we know the coefficients in Stirling's approximation (|4,39|) for lnT(z) in 
terms of Bernoulli numbers.] 

(c) Is this likely to be useful for high-precision computation of T(x) for real 
positive xl 

Exercise 4.42 Deduce from Eq. (|4,7U|) and (|4.71J1 an expansion of ln(4/£;) with 
error term 0(fc 4 log(4/fc)). Use any means to figure out an effective bound on the 
0() term. Deduce an algorithm requiring x > 2 n ' 4 only to get n bits of In a;. 

Exercise 4.43 Show how both -k and In 2 can be evaluated using Eq. (|4.72|) . 

Exercise 4.44 Improve the constants at the end of SJ4.8.5I 

Exercise 4.45 (Salvy) Is the function exp(x) + x/(l — x 2 ) D-finite? 

Exercise 4.46 If w = e 2m / k , show that 

K 1 t-h 



-1 k *-" 



z k — l fc ' — ' z — W 

m=0 



Deduce that S^fc, defined by equation (|4.84|) . satisfies 

1 f z k ~i~ x 



1 f z J ~ 

Sj k = / — ; f(z) dz 

J ' 2ni J c , z k - 1 J y ' 



for j < k, where the contour C is as in M4.101 Deduce Equation (|4.85(1 . 
Remark. Equation (|4.85|) illustrates the phenomenon of aliasing: observations at 
k points can not distinguish between the Fourier coefficients a,j, flj+fc, «j+2fc> etc. 

Exercise 4.47 Show that the sum S^j^k of ^4. 101 can be computed with (essen- 
tially) only about fe/4 evaluations of / if k is even. Similarly, show that about k/2 
evaluations of / suffice if k is odd. On the other hand, show that the error bound 
0((2-7r) -fc ) following equation (|4.87|) can be improved if k is odd. 



4.12 Notes and References 

One of the main references for special functions is the "Handbook of Mathematical 
Functions" by Abramowitz & Stegun [Q, which gives many useful results but no 
proofs. A comprehensive reference is Andrews et al |3j. A more recent book is 
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that of Nico Temme |163| . A large part of the content of this chapter comes from 
|39j . and was implemented in the MP package |38| . 

Basic material on Newton's method may be found in many references, for 
example Brent |321 Ch. 3] and Traub |165| . Some details on the use of Newton's 
method in modern processors can be found in [22]. The idea of first computing 
y~ 1 ' 2 , then multiplying by y to get y 1 ' 2 f £14,2.3(1 was pushed further by Karp and 
Markstein |107| . who perform this at the penultimate iteration, and modify the 
last iteration of Newton's method for y -1 ' 2 to directly get y 1 ' 2 (see £11.4.51 for an 
example of the Karp-Markstein trick for division). For more on Newton's method 
for power series, we refer to |3"ill4*5"ll4l?llll0| . 

Some good references on error analysis of floating-point algorithms are the 
books by Higham [92] and Muller (135] . Older references include Wilkinson's 
classics [TT6HTT7J . 

Regarding doubling versus tripling: in £14.3,41 we assumed that one multiplica- 
tion and one squaring were required to apply the tripling formula [4.191 However, 
one might use the form sinh(3x) = 3sinh(x) + 4sinh (x), which requires only one 
cubing. Assuming a cubing costs 50% more than a squaring — in the FFT range 
- the ratio would be 1.51og 3 2 ~ 0.946. Thus, if a specialised cubing routine is 
available, tripling may sometimes be slightly faster than doubling. 

For an example of a detailed error analysis of an unrestricted algorithm, see |57j . 

The rectangular series splitting to evaluate a power series with 0(y/n) non- 
scalar multiplications ( £14.4.3)) was first published by Paterson and Stockmeyer in 
1973 |139j . It was then rediscovered in the context of multiple-precision evalua- 
tion of elementary functions by Smith in 1991 [1561 §8.7]. Smith gave it the name 
"concurrent series". Note that Smith proposed modular splitting of the series, 
but classical splitting seems slightly better. Smith noticed that the simultaneous 
use of this fast technique and argument reduction yields Oin^'^M^)) algorithms. 
Earlier, in 1960, Estrin had found a similar technique with n/2 nonscalar multi- 
plications, but O(logn) parallel complexity [75*] . 

There are several variants of the Euler-Maclaurin sum formula, with and with- 
out bounds on the remainder. See for example Abramowitz and Stegun ^ Ch. 23], 
Apostol 0, and the references given on the relevant Wikipedia and Mathworld 
pages. 

Most of the asymptotic expansions that we have given in £14.51 may be found 
in Abramowitz and Stegun [Q. For more background on asymptotic expansions 
of special functions, see for example the books by de Bruijn [H?|], Olver |137j and 
Wong |178j . We have omitted mention of many other useful asymptotic expansions, 
for example all but a few of those for Bessel functions |1731 I175J . 

Most of the continued fractions mentioned in £14.61 may be found in Abram- 
owitz and Stegun \$\. The classical theory is given in Wall's book |172j . Continued 
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fractions are used in the manner described in ^4.61 in arbitrary-precision packages 
such as MP J3H]- A good recent reference on various aspects of continued fractions 
for the evaluation of special functions is the Handbook of Continued Fractions for 
Special Functions [HE]- I n particular, Chapter 7 of this book contains a discussion 
of error bounds. Our Theorem 14.6.11 is a trivial modification of |681 Theorem 
7.5.1]. The asymptotically fast algorithm suggested in Exercise 14 . 32 1 was given by 
Schonhage J148J . 

A proof of a generalisation of (|4.55|> is given in [3J §4.9]. Miller's algorithm is 
due to J. C. P. Miller. It is described, for example, in §9.12,§19.28] and [Ml 
§13.14]. An algorithm is given in |82j . 



A recurrence based on (j4,61[) was used to evaluate the scaled Bernoulli num- 
bers Ck in the MP package following a suggestion of Christian Reinsch J391 §12]. 
Previously, the inferior recurrence ()4.60j) was widely used, for example in J108J and 
in early versions of the MP package |38( §6.11]. The idea of using tangent numbers 
is mentioned in J1441 §6.5], where it is attributed to B. F. Logan. Our in-place 
Algorithms Tangent-numbers and Secant-numbers may be new (see Exercises 
14.381 44~4"U]) . Kaneko |105| describes an algorithm of Akiyama and Tanigawa for 
computing Bernoulli numbers in a manner similar to "Pascal's triangle" . However, 
it requires more arithmetic operations than our algorithm Tangent-numbers. 
Also, the Akiyama- Tanigawa algorithm is only recommended for exact rational 
arithmetic, since it is numerically unstable if implemented in floating-point arith- 
metic. For more on Bernoulli, Tangent and Secant numbers, and a connection with 
Stirling numbers, see Chen [53] and Sloane \TES\ A027641, A000182, A000364]. 

The Von Staudt-Clausen theorem was proved independently by Karl von Staudt 
and Thomas Clausen in 1840. It can be found in many references (for example, 
Wikipedia). If just a single Bernoulli number of large index is required, then 
Harvey's modular algorithm [HB] can be recommended. 

Some references on the Arithmetic-Geometric Mean (AGM) are Brent |341 
1371 I42j . Salamin J145J . the Borweins' book [2B], Arndt and Haenel jZj. An early 
reference, which includes some results that were rediscovered later, is the fasci- 
nating report HAKMEM ^21- Bernstein ^Hl gives a survey of different AGM 
algorithms for computing the logarithm. Eq. (|4.71f) is given in Borwein & Bor- 
wein J2H1 (1.3.10)], and the bound (l4~74l> is given in [23 p. 11, ex.4(c)]. The AGM 
can be extended to complex starting values provided we take the correct branch 
of the square root (the one with positive real part): see Borwein & Borwein |28[ 
pp. 15-16]. The use of the complex AGM is discussed in [73]. For theta function 
identities, see [23 Chapter 2], and for a proof of (|4~78|) . see [23 §2.3]. 

The use of the exact formula (|4.79|) to compute lnx was first suggested by 
Sasaki and Kanada (see [2H1 (7.2.5)], but beware the typo). See [HZ] for Landen 
transformations, and [HI] for more efficient methods; note that the constants given 
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in those papers might be improved using faster square root algorithms (Chapter^). 

The constants from (|4,8.5|) are obtained as follows. We assume we are in the 
FFT domain, and one Fourier transform costs ^M(n). The 2.0833M(n) cost for 
real division is from [2^. The complex division uses the "faster" algorithm from 



|341 Section 11] which computes l+fjj as - — ^v" sa > with one complex multi- 
plication for (t + iu)(v — iw), two squarings for v 2 and w 2 , one reciprocal, and 
two real multiplications; noting that v 2 + w 2 can be computed in M(n) by shar- 
ing the backward transform, and using Schonhage's 1.5M (n) algorithm for the 
reciprocal, we get a total cost of 6.5M(n). The 1.8333M(n) cost for the real 
square root is from [TJ]. The complex square root uses Friedland's algorithm |79| : 

\/x + iy = w + iy/(2w) where w = y (\x\ + \J x 2 + y 2 )/2; as for the complex di- 
vision, x 2 + y 2 costs M(n), then we compute its square root in 1.8333M(n), and 
we use Bernstein's 2.5M(n) algorithm ^7] to compute simultaneously w 1 ' 2 and 
w -1 ' 2 , which we multiply by y in M(n), which gives a total of 6.333M(n). The 
cost of one AGM iteration is simply the sum of the multiplication cost and of the 
square root cost, while the logarithm via the AGM costs 21g(n) AGM iterations. 

There is some disagreement in the literature about "binary splitting" and the 
"FEE method" of E. A. Karatsuba |106j . We choose the name "binary splitting" 
because it is more descriptive, and let the reader call it the "FEE method" if he/she 
prefers. Whatever its name, the idea is quite old, since in 1976 Brent |36l Theorem 
6.2] gave a binary splitting algorithm to compute expx in time O (M(n) (log n) 2 ). 
The CLN library implements several functions with binary splitting jHH], and is 
thus quite efficient for precisions of a million bits or more. The "bit-burst" algo- 
rithm was invented by David and Gregory Chudnovsky jSU, and our Theorem l4.9.1l 
is based on their work. 

For more about D-finite functions, see for example the Maple gfun package 
|146| . which allows one, among other things, to deduce the recurrence for the 
Taylor coefficients of f(x) from its differential equation. 

There are several topics that are not covered in this Chapter, but might be 
in a later version. We mention some references here. A useful resource is the 
website |164j . 

The Riemann zeta function £(s) can be evaluated by the Euler-Maclaurin ex- 
pansion (|4,35[) - (|4,37|) . or by Borwein's algorithm |27( I3flj . but neither of these 
methods is efficient if 9(s) is large. On the critical line K(s) = 1/2, the Riemann- 
Siegel formula |HJ is much faster and in practice sufficiently accurate (though only 
an asymptotic expansion - the error seems to be 0(exp(— irt)) where t = Q(s): see 
the review [HZ1)- An error analysis is given in |141j . The Riemann-Siegel coeffi- 
cients may be defined by a recurrence in terms of certain integers p n that can be 
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defined using Euler numbers (see Sloane's sequence A087617 1155^ . Sloane calls 
this the Gabcke sequence but Gabcke credits Lehmer J118J so perhaps it should 
be called the Lehmer- Gabcke sequence. The sequence (p n ) occurs naturally in the 
asymptotic expansion of ln(T(l/4 + it/2)). The non-obvious fact that the p n are 
integers was proved by de Reyna |7H] . 

Borwein's algorithm for £(s) can be generalised to cover functions such as the 
polylogarithm and the Hurwitz zeta function: see Vepstas |170| . 

In ^4.10l we briefly discussed the numerical approximation of contour integrals, 
but we omitted any discussion of other forms of numerical quadrature, for example 
Romberg quadrature, the tanh rule, the tanh-sinh rule, etc. Some references are 
|81l51 ITUlll33U162| . and [23 §7.4.3]. For further discussion of the contour integration 
method, see |12()j . For Clenshaw-Curtis and Gaussian quadrature, see (SilliniElIl- 
An example of the use of numerical quadrature to evaluate T(x) is 1241 p. 188]. 
This is an interesting alternative to the method based on Stirling's asymptotic 
expansion (|4.5|) . 

We have not discussed the computation of specific mathematical constants such 
as 7T, 7 (Euler's constant), etc. -k can be evaluated using ix = 4arctan(l) and a 
fast arctan computation f fl4.9.2|) : or by the Gauss-Legendre algorithm (also known 
as the Brent- Salamin algorithm), see |3"4"ll3THl45| . This asymptotically fast algo- 



rithm is based on the arithmetic-geometric mean and Legendre's relation (J47 
There are several popular books on n: we mention Arndt and Haenel |2j. A more 
advanced book is the one by the Borwein brothers |28j . 

The computation of 7 and its continued fraction is of interest because it is not 
known whether 7 is rational (though this is unlikely). The best algorithm for com- 
puting 7 appears to be the "Bessel function" algorithm of Brent and McMillan |47j . 
as modified by Papanikolaou and later Gourdon |85| to incorporate binary split- 
ting. A very useful source of information on the evaluation of constants (including 
7r, e, 7, In 2, C(3)) and certain functions (including T(z) and C( s )) is Gourdon and 
Sebah's web site |85] . 



Chapter 5 

Appendix: Implementations 
and Pointers 



This chapter is a non- exhaustive list of software packages that the authors 
have tried, together with some other useful pointers. 

5.1 Software Tools 

5.1.1 CLN 

CLN (Class Library for Numbers, http://www.ginac.de/CLN/) is a library 
for efficient computations with all kinds of numbers in arbitrary precision. 
It was written by Bruno Haible, and is currently maintained by Richard 
Kreckel. It is written in C++ and distributed under the GNU General Public 
License (GPL). CLN provides some elementary and special functions, and 
fast arithmetic on large numbers, in particular it implements Schonhage- 
Strassen multiplication, and the binary splitting algorithm jHEj- CLN can be 
configured to use GMP low-level mpn routines, which "is known to be quite 
a boost for CLN's performance". 

5.1.2 GNU MP (GMP) 

The GNU MP library is the main reference for arbitrary-precision arithmetic. 
It has been developed by Torbjorn Granlund and other contributors since 
1993 (the first public version, GMP 1.1, was released in September 1991). 
GNU MP (GMP for short) implements several of the algorithms described in 

193 
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this book. In particular, we recommend reading Chapter Algorithms of the 
GMP reference manual |84j . GMP is written in the C language, is released 
under the GNU Lesser General Public License (LGPL), and is available from 
Igmplib.orgl 

GMP's mpz class implements arbitrary-precision integers (corresponding 
to Chapter [TJ, while the mpf class implements arbitrary-precision floating- 
point numbers (corresponding to Chapter |3J). The performance of GMP 
comes mostly from its low-level mpn class, which is well designed and highly 
optimized in assembly code for many architectures. As of version 4.3.0, 
mpz implements different multiplication algorithms (schoolbook, Karatsuba, 
Toom-Cook 3- way, Toom-Cook 4-way, and Schonhage-Strassen); its division 
routine implements Algorithm Recur siveDiv Rem ( fcll.4.3|) and has thus 
non-optimal complexity 0(M(n) log n) instead of 0(M(n)), and so does its 
square root, which implements Algorithm SqrtRem, since it relies on di- 
vision. It also implements unbalanced multiplication, with Toom-Cook 3, 2 
and Toom-Cook 4,2 J2S1- GMP 4.3.0 does not implement elementary nor 
special functions (Chapter HJ), and neither provides modular arithmetic with 
invariant divisor ( Chapter 0), however it contains a preliminary interface for 
Montgomery's REDC algorithm. 



5.1.3 MPFQ 

MPFQ is a software library developed by Pierrick Gaudry and Emmanuel 
Thome for manipulation of finite fields. What makes MPFQ different from 
other modular arithmetic libraries is that the target finite field is given at 
compile time, thus more specific optimizations can be done. The two main 
targets of MPFQ are the Galois fields F 2 n and F p with p prime. MPFQ 
is available from http://www.mpfq.org/, and distributed under the GNU 



Lesser General Public License (LGPL). 



5.1.4 MPFR 

MPFR is a multiple-precision binary floating-point library, written in portable 
C language, based on the GNU MP library, and distributed under the GNU 
Lesser General Public License (LGPL). It extends to arbitrary-precision 
arithmetic the main ideas of the IEEE 754 standard, by providing correct 
rounding and exceptions. MPFR implements the algorithms of Chapter |3] 
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and most of those of Chapter 0J including all mathematical functions de- 
fined by the ISO C99 standard. These strong semantics are in most cases 
achieved with no significant slowdown compared to other arbitrary-precision 
tools. For details of the MPFR library, see http://www.mpfr.org and the 
ACM TOMS paper 



5.2 Mailing Lists 

5.2.1 The BNIS Mailing List 

The BNIS mailing-list was created by Dan Bernstein for "Anything of inter- 
est to implementors of large-integer arithmetic packages" . It has low traffic 



(a few messages per year only). See http : //cr . yp . to/lists . html to sub 



scribe. An archive of this list is available at 



http : //www . nabble . com/cr . yp . to bnis-f 846 . html 



5.2.2 The GMP Lists 

There are four mailing-lists associated to GMP: gmp-bugs for bug reports; 
gmp-announce for important announcements about GMP, in particular new 
releases; gmp-discuss for general discussions about GMP; gmp-devel for 
technical discussions between GMP developers. We recommend subscription 
to gmp-announce (very low traffic), to gmp-discuss (medium to high traffic), 
and to gmp-devel only if you are interested in the internals of GMP. 



5.3 On-Line Documents 

The Wolfram Functions Site (http://functions.wolfram.com/) contains a 
lot of information about mathematical functions (definition, specific values, 
general characteristics, representations as series, limits, integrals, continued 
fractions, differential equations, transformations, and so on). 



The Encyclopedia of Special Functions (http : //algo . inria.fr/esf/ ) is 
another nice web site, whose originality is that all formulae are automatically 
generated from very few data that uniquely define the corresponding function 
in a general class |126j . 
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A huge amount of information about interval arithmetic can be found on 
the Interval Computations page (http : //www . cs . utep . edu/interval- comp/ ) 
(introduction, software, languages, books, courses, information about the in- 
terval arithmetic community, applications). 

Mike Cowlishaw maintains an extensive bibliography of conversion to and 
from decimal arithmetic at http://speleotrove.com/decimal/ 
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Beeler, Michael, HH 
Bernoulli numbers, EH MB MB MB 
IT7Q1IT50I 

Akiyama-Tanigawa algorithm. I19C 

complexity of evaluation, 11861 

Harvey's algorithm, 11901 

scaled, [TT] 

space required for, I179| 11861 

stable computation. ITOH 1T501 1T001 

via tangent numbers, 11641 
Bernstein, Daniel Julius, I52TI5^ 

IT001 
Bernstein, Robert. l2*5l 
Bessel functions, 11621 

first kind, J„(x). lT0Tl 

in computation of 7, I155| 11921 

Miller's algorithm. 11621 

second kind, Y,,(x). 1T0T1 
Bessel's differential equation, 11611 
binary coded decimal (BGD). 1571 
binary exponentiation, 1771 
binary polynomial, EFJ 
binary splitting, IT75HT751 1T0T1 ITQ5I 

for sin/cos, 11761 
binary- integer decimal (BID),|57J 
Binomial coefficient, ^2 
bit-burst algorithm, I175H1791 
Bodrato, Marco, E31 
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Boldo, Sylvie, IT2H1 
Bonan-Hamada, Catherine, 11901 
Booth representation, I5U1 
Bornemann, Folkmar, 11921 
Borodin, Allan, EU 
Borwein, Jonathan Michael. II 681iT""""l 

IMJIITMl 
Borwein, Peter Benjamin. ITMtTTUllMH - 

nn 

Bostan, Alin, [""2*9*1 

Brent, Erin Margaret, H 

Brent, Richard Peirce, El El ESI 

Brent-McMillan algorithm, *155l *TT***1 

Brent-Salamin algorithm. 11681 H92I 

Briggs, Keith, El 

de Bruijn, Nicolaas Govert (Dick), 

Burgisser, Peter, 11301 

butterfly, d 



Cauchy principal value integral, 11521 

Cauchy's theorem, 11791 

Chen, Kwang-Wu, ITTTHl 

Chinese remainder theorem (CRT). """""! 

MM 

explicit, E""] 

reconstruction, El El El 
Chudnovsky, David Volfovich. *TT**1 fT**Tl 
Chudnovsky, Gregory Volfovich, 11751 

eh 

classical splitting, 11501 
Clausen, Michael, [l 
Clausen, Thomas, [l 
Clenshaw, Charles W..H32I 
Clenshaw, Charles William, El 
Clinger, William Douglas, 11301 
CLN. ITnniTMl 
Collins, George Edwin, 11281 



complementary error function, see 

eric(x) 
complex multiplication, 11731 
continued fraction 

approximant, 11591 

backward recurrence, 11591 11851 

error bound, 11611 11851 

fast evaluation, 11851 11901 

for En .11531 

for erfc, 11591 

forward recurrence, 11591 11851 

notation for, [""""J 11591 
contour integration, 11791 11921 
Cook, Stephen Arthur, see Toom-Cook 
Cornea-Hasegan, Marius Adrian, 11291 
cosh (z). IH3I 

Cowlishaw, Mike, EDI IT****1 
Crandall, Richard Eugene, 11291 
Crary, Fred D..IT3D 
CRT, see Chinese remainder theorem 
Curtis, Alan R., 
Cuyt, Annie, [*X 



deg,H2l 

determinant 

notation for, El 
division, EHKEHl E3 

by a constant, El E 

divide and conquer, 12*9*1 

exact, El E2 

time for, El 

unbalanced, [""""J "5~2~1 
Dixon, Brandon, El 
doubling formula, ****IUH****v*l ITHTl *T*8*21 

for exp, 11401 

for sin, 11401 

for sinh, 11431 

versus tripling, 11431 
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Dupont, Regis, Unni 



Ein(a;),H 

elliptic curve cryptography. 

elliptic curve method (ECM) 
elliptic integral, 11671 

first kind, 11671 

modulus, 11671 

nome, 11701 

second kind, 11671 
Enge, Andreas, EH IT271 
entire function, 11481 
Ercegovac, Milos Dragutin, [I 

erf(x), im m una 

error function, see erf (x) 
Estrin, Gerald, ITB3I 
Euler's constant. HM ITH2I 

Brent- McMillan algorithm. 11551 



Euler-Maclaurin approximation 
Euler's totient function, ^] 
Euler-Maclaurin formula, 11541 11891 
exp(x), see exponential 
exponent, EH 1871 
exponential 

expml,[Hl[IH2 

addition formula, I14UI 

notations for, IT2l 
exponential integral, 11521 11581 E 

ITH51 
extended complex numbers C, [15 
extended gcd, 1721 



Fast Fourier transform (FFT), 
132111231 

over GF(2)[x],ESI 

use for multiplication, 1105 
FEE method, El IT3D 



Fejer, Leopold, [Ml 

Fermat, Pierre de.[771 

FFT, see Fast Fourier transform 

floating point 

addition, EHl El 

choice of radix, 11281 

comparison, 19*81 

conversion, 11221 11301 

division, [T 

encoding, 

expansions, E21 11281 

guard digits, 11421 

input, EH] 

loss of precision, 11411 

multiplication, 11021 

output, H2H 

reciprocal, QHSl 1121 

reciprocal square root. 11191 
\1112\ redundant representations, E 



■ 11551 representation. 1851 
square root. 11181 
subtraction, EEJ HT 
via integer arithmetic, 19*21 
Fourier transform, [!"""] 
functional inverse, 11311 
Fiirer, Martin, EHl "L"T 



Gab eke sequence, E 

Gabcke, Wolfgang, ITiTTl 

7, see Euler's constant 

Gamma function r(x). """4"Tl 

Gaubatz, Gunnar,| 

Gaudry, Pierrick, ["2 

Gauss- Legendre algorithm, 11921 

Gautschi, Walter, HHOl 

Gay, David M..TTM 

gcd 

binary, H E7J 
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double digit, HUES] 
Euclidean, EJJ EH 
extended, |H 
notation for, ^] 
subquadratic, IHHSl 
Girgensohn, Roland, 11921 

GMP, eh eh 

Golliver, Roger Allen, \TM 
Goodwin, E. T.. 11301 
Gopal, Vinodh, El 
Gosper, R. William, Jr. .IMP 
Gourdon, Xavier, 11921 
Graham, Ronald Lewis, I19UI 
Granlund, Torbjorn, [521 H93I 
greatest common divisor, see gcd 
Grotefeld, Andreas Friedrich Wilhelm, 



Horner, William George, 11441 
Hull, Thomas Edward, EH 
hypergeometric function. 11471 11681 II 771 

IEEE 754 standard, El EH 

IEEE 854 standard, HM 

INRIA, H 

integer reconstruction, 031 

Intel, EU 

interval arithmetic, 11961 

Iordache, Cristina S.. 11291 



mard digits, EM EH E 



Haenel, Christoph, El 

Haible, Bruno. ITTJSl 

HAKMEM,[E 

half-gcd, [721 

Hanrot, Guillaume. I5U1 I5T1 IT! 

Harmonic number, ^TJ 11831 

Harvey, David, El EJJ EH EM 

Hasenplaugh, William, IS3*1 

Hensel 

division, ESHH El El El EJJ El 

El El El 
lifting, El ESI il El E3 EH1 

Hensel, Kurt. 1571 
Heron of Alexandria, 11361 
Higham, Nicholas John, EH EH 
Hille, Einar, HTTP 
Hopcroft, John Edward, El 
Horner's rule, EH EM EUJ 
forward, 11811 



Jebelean, Tudor, | 

Johnson, Jeremy Russell, 11291 

Jones, William B.,EM 

Kahan, William Morton, 113711 
Kanada, Yasumasa, 11711 I19UI 
Kaneko, Masanobu, I19UI 
Karatsuba's algorithm, EJHH El EJJ 

El El EH 

Karatsuba, Anatolii Alexeevich, loTl 

Einni 

Karatsuba, Ekatherina A.. 11911 
Karp, Alan Hersh, El EH 
Karp-Markstein trick, El El EH 
Kidder, Jeff, EH 

Knuth, Donald Ervin, El EM EM 
Koornwinder, Tom Hendrik, 11911 
Krandick, Werner, EH EH 
Kreckel, Richard Bernd, 11931 
Kronecker-Schonhage trick. [T71 IBTJl 1521. 

El El El 

Kulisch, Ulrich Walter Heinz, EH 
Kung, Hsiang Tsung, El 

Lagrange interpolation, [H 
Landen transformations, 11721 11911 
Lang, Tomas, 11291 
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Laurie, Dirk, IT921 
Lecerf, Gregoire,| 

Lefevre, Vincent, [T! 

Lehmer, Derrick Henry, El 11921 

Lehmer-Gabcke sequence, 11921 

Lenstra, Arjen Klaas, El 

Lickteig, Thomas, 11301 

In, see logarithm 

loglp, see logarithm 

log, see logarithm 

Logan, Benjamin Franklin "Tex", Jr., 

EM 

logarithm 

lg(aO,ln(z),log(x),H2 

loglp, IHH1 ITK21 

addition formula, I14UI 

computation via AGM. ITo^l 

notations for, ^] 

Sasaki-Kanada algorithm, 11711 
logical operations, IT2l 
Lyness, James N.. 11921 

machine precision, ^1 
Maeder, Roman, 1501 
mailing lists, 11951 
mantissa, see significand 

Mapie, nnu 

Markstein, Peter, El 1121 HHU 
Martin, D. W..H3H1 
Matula, David William, IT2H1 
Maze, Gerald, El 
McLaughlin's algorithm, loll 
McLaughlin, Philip Burtis, Jr., E 

El El 

McMillan, Edwin Ms 
Menezes, Alfred J., 
Menissier-Morain, Valerie, E 
middle product, El EU EHZI 



Mihailescu, Preda, 

Miller's algorithm, EH EH 11101 

Miller, Jeffrey Charles Percy ITH21 IMH 

modular 

addition, EH 

exponentiation, [71 IH31 

inversion, [71 

multiplication, EHHZ2 
modular exponentiation 

base 2 k ,UB 
modular inversion, [H 
modular splitting, I15UI 
Moler, Cleve Barry, E 
Moller, Niels, El 
Montgomery's algorithm, [H 
Montgomery's form, El El 
Montgomery multiplication, 

subquadratic, El 
Montgomery reduction, E] 
Montgomery, Peter Lawrence, Ei 



Montgomery-Svoboda's algorithm, E7J 

El El El 

Mori, Masatake, E 

MP, EH EH 

MPFQ,EH1 

MPFR,EH1 
Mulders, Thorn, Q] 
Muller, Jean-Michel, El 
multiplication 

by a constant, [^ 

complex, 11731 

Fiirer's algorithm, 

Karatsuba's algorithm, 11731 

of integers, ITTH2"5l 

Schonhage-Strassen, 1571 

time for, ^] 

unbalanced, [21 El 
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via complex FFT, ITU51 
Munro, Ian, 



Newton's method, El ESHSZ1 EH EH 

mm 11221 mmrnsiiTHni 

for functional inverse, 11371 11471 

for inverse roots, 11341 

for power series, 

for reciprocal, [II 

for reciprocal square root. 11351 

higher order variants, 11381 

Karp-Marstein trick, 11891 
Nicely, Thomas R. .11351 
NIST,EU 

normalized divisor, |2HI 
Not a Number (NaN).IEEI 
Nowka, Kevin John, IT23I 
NTL,E1 
numerical differentiation, \~ 



odd zeta-function, [Tf 

odd-even scheme, E3 ES HS01 EEH 

Olver, Frank William John, HHH IMH 

ord,E2 

Osborn, Judy-anne Heather, E] 

Petermann, Y.-F.S.IT32I 
Paar, Christof, loTl 
Pan, Victor, fT29l 
Papanikolaou, Thomas, 11921 
Patashnik, Oren, fT^Hl 
Paterson, Michael Stewart. 11891 
Payne and Hanek 

argument reduction, 11081 11291 

PCMULQDQ,E1 
Pentium bug, 11351 
Percival, Colin, ESI [123 DHH 
Petersen, Vigdis Brevik, I19UI 

7T,nn2 



Brent- Salamin algorithm. 11681 IT§U1 
Pollard, John Michael, 
polynomial evaluation, 
power series 

argument reduction, 11481 

assumptions, 11471 

backward summation, 11421 11451 

direct evaluation. 11471 

forward summation, 11421 11451 

radius of convergence, 11471 
precision, ITT1 

local/global, ED 

machine, 11451 

operand/operation, |^ 11281 

working, E3 ESI 
Priest, Douglas M.. I32| [HS1 
pseudo-Mersenne prime, T 



Quercia, Michel, EDI EU 
quotient selection, EF 

Remy, Jean-Luc, 11921 

radix, IHBI 

reciprocal square root. 11191 11351 

rectangular series splitting, I149H1521 



recurrence relations, 11611 

REDC,E01E21 

Reid, John K., EH 

Reinsch, Christian, 11901 

relaxed multiplication, |M] 

residue number system fRNS). l5l1l8T)l 

El 

Reyna, Juan Arias de, 11911 
Riemann zeta function 

Borwein's algorithm, 11911 

error analysis, 11921 

Euler-Maclaurin expansion, 11551 
ElU 
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Riemann-Siegel formula. 11911 
Riemann zet a- function, "T" 
root 

fc-th, 

inverse, [C 

square, ESHS3 EHl 
Rosser, John Barkley, [l! 
rounding 

away from zero, 

boundary, 19*21 

correct, E2J ESI 

modes, EH 

notation for, [i"U 

strategies for, 19*71 

towards zero, |""3] 
rounding:to nearest, 
Roy, Ranjan,HHnilISni 
runs of zeros/ones, E 
Ryde, Kevin, EI 



Salamin, Eugene, [TJ 

Salvy, Bruno. ITHH1 

Sasaki, Tateaki, [""""""] dm 

Schmookler, Martin S.. 

Schonhage, Arnold, El El El EHl 

tpm nnni 

Schonhage-Strassen algorithm, ESI [7TJ 

eh 

Schost, Eric, ESI HHD 

Schroeppel, Richard C.. I19UI 
Sebah, Pascal, "T""2"l 
Secant numbers, 11661 11871 
Sedoglavic, Alexandre, El 
segmentation, see Kronecker- 

Schonhage trick 
Shand, Mark Alexander, EU 
Shokrollahi, Mohammad Amin, 11301 
short division, 11291 



short product, ESI [Ml EHl EHl 
Shoup, Victor, E2 
significand, 1551 
sm(x).TTM 

sinh(x),E13 

sliding window algorithm, [71 

Smith's method, see rectangular 

series splitting 
Smith, David Michael, """""91 
software tools, 11931 
Sorenson, Jonathan P. . |4"""1 l""4"l 
Sorenson, Jonathan Paul, EU 
square root, see root 
squaring, EH EU 

complex, 11731 
Staudt, Karl Georg Christian von. H9(JI 
Steel, Allan, El 
Steele, Guy Lewis, Jr.. 11301 
Stegun, Irene Anne, "THU "T90l 
Stehle, Damien, EU 
Sterbenz's theorem. 11011 
Stirling numbers, 11841 11901 
Stirling's approximation 

convergent form, 11841 

for \nT(iv).fiM 

for lnr(x),[ISHl 

for lnT(z).nm 

for n! or r(z). II^IT33IIT^IT5?"l 

EH3M 

with error bounds, 11541 
Stockmeyer, Larry Joseph, 11891 
Strassen, Volker, 11301 
subnormal numbers, ESI 

smallest, EU 
substitution, see Kronecker- 

Schonhage trick 
Svoboda's algorithm, El EU EU E7J 
El El 
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Svoboda, Antonin, 1371 



Takahasi, Hidetosi, 

tanfxl. inOl fTMl 

tangent numbers, EH EH CHI USD] 

algorithm for, 11651 

complexity of evaluation, E 

space required for, 11861 
Tellegen's principle, 11291 
Temme, Nico M., HHU [HH 
theta functions, 11691 
Thome, Emmanuel, EIH |^ tH 
Toom-Cook, EHH21 EH ESI El 
Traub, Joseph Frederick, 11891 
Trefethen, Lloyd N. "Nick",| 
tripling formula 

for sinh. 114^1 

for sin, 11401 

in FFT range, □ 



Ullman, Jeffrey David 

unit in the last place (urp).ITT 

unrestricted algorithm, C 



Vallee, Brigitte, El 
valuation, IT21 
Van Loan, Charles Francis 
van Oorschot, Paul Cornells 
Vanstone, Scott Alexander 
Vepstas, Linas, 11921 
Verdonk, Brigitte, IMH 
Vetter, Herbert Dieter Ekkehart, 
Vidunas, Raimundas, 11911 
Vuillemin, Jean, EH EH 

Waadeland, Haakon, 11901 
Wagon, Stan, EH2 
Waldvogel, Jorg, EU 
Wall, Hubert Stanley, 11531 



Wang, Paul Shyh-Horng, I5H 

Watson, George Neville, | 

Weber functions, Y u (x),[Ti 

Weber, Kenneth, 02] 

Weimerskirch, Andre, |^ 

White, Jim, IPH21 

White, JonL-lLTOl 

Whittaker, Edmund Taylor, [T] 

Wilkinson, James Hardv. IT2"%1 ITHHl IMil 

Wong, Roderick, ITE3I 

wrap-around trick, E01 11131 



Zanoni, Alberto, | 
Zimmermann, Marie, H] 
Zimmermann, Paul, EU 11291 
Ziv's algorithm, 
Zuras, Dan.lBTI 



Summary of Complexities 



Integer Arithmetic (n-bit input unless said otherwise) 


Addition, Subtraction 


0{n) 


Multiplication 


M(n) 


Unbalanced Multiplication 


M(m,n) -min([ m lM(n),M( m + n )) 


Division 


0{M{n)) 


Unbalanced Division (with remainder) 


D(m + n, n) = 0(M(m, n)) 


Square Root 


0(M(n)) 


fc-th Root (with remainder) 


0{M{n)) 


GCD, extended GCD 


0(M(n) log n) 


Base Conversion 


0(M(n) log n) 



Modular Arithmetic (n-bit modulus) 


Addition, Subtraction 


0{n) 


Multiplication 


M{n) 


Division, Inversion 


0{M(n) log n) 


Exponentiation (fc-bit exponent) 


0{kM(n)) 



Floating-Point Arithmetic (n-bit input and output) 


Addition, Subtraction 


0{n) 


Multiplication 


M(n) 


Division 


0{M{n)) 


Square Root 


0(M(n)) 


fc-th Root 


0(M(n)) 


Base Conversion 


0(M(n) log n) 
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